AD-A221  428 


OTIC  FILE  COPY  •  •  Volume  II 

PROCEEDINGS: 

Image  Understanding  Workshop  Oj 


§  i  -  0  P-  o  f 

PHOTO  INTERPRET ATION  CARTOGRAPHY 

ROBOTIC  VISION  AUTONOMOUS  NAVIGATION 

AUTOMATED  PARTS  INSPECTION  SURVEILLANCE 


yfvL 


MODEL  BASED  IMAGE  UNDERSTANDING 

i  t 


OBJECT 

MODEL 


3D 

MODEL 


TEXTURE 

MODEL 


2D 

MODEL 


ieela\ 

Turret 


I— MATCH - 


Rectangular  MATCH— | 

Solid 


Painted 

Metal 


—  MATCH - 1 


Rectangularly  __  ^ATCH - 

Circular 


OBJECT 

0 

VOLUME 


SURFACE 


SHAPE 


DTIC 

ELECTE  I 
MAY  1  5  1990  j 


"0 


LINEAR 

MODEL 


INTENSITY 

MODEL 


Collnear  I  . 


DISTRIBUTION  STATEMENT  A 

Approved  lei  public  releaael 

DiitiLDunon  U  Dimmed 

Jl  I 

_ atmpjon  OAHPAAtTO  *■ 


Sponsored  by: 

Defense  Advanced  Research  Projects  Agency 
Information  Science  and  Technology  Office 


February  1987 


mm 


Image  Understanding  Workshop 


Proceedings  of  a  Workshop 
Held  at 

Los  Angeles,  California 

February  23-25,  T987 


Volume  II 


Sponsored  by: 

Defense  Advanced  Research  Projects  Agency 
Information  Science  and  Technology  Office 


AcCsf-i.'  '  l'r 


NT  IS 

o;»<  r*;< 

U-  J 

J  <  I  .  , 


By  . . 

Dist'ib'.tio:’  I 


A.,.:, 


Dist 


(A.  i 


IW 


This  document  contains  copies  of  reports  prepared  for 
the  DARPA  Image  Understanding  Workshop.  Included 
are  results  from  both  the  basic  and  strategic  computing 
programs  within  DARPA/ISTO  sponsored  projects. 


APPROVED  FOR  PUBLIC  RELEASE 
DISTRIBUTION  UNLIMITED 

The  views  and  conclusions  contained  in  this  document  are  those  of  the  authors  and  should  not  be  inter¬ 
preted  as  necessarily  representing  the  official  policies,  either  expressed  or  implied  of  the  Defense  Advanced 
Research  Projects  Agency  or  the  United  States  Government 


90  05  14 


Distributed  by 
Morgan  Kaufmann  Publishers 
95  First  Street 
Los  Altos,  California 


,  Inc. 

94022 


ISBN  0-934613-36-2 

Printed  in  the  United  States  of  America 


TABLE  OF  CONTENTS 


Page 

AUTHOR  INDEX  .  i 

VOLUME  I 

SECTION  I  ~  PROGRAM  REVIEWS  BY  PRINCIPAL  INVESTIGATORS 

"Vision  in  Dynamic  Environments",  Azriel 
Rosenfeld,  Larry  S.  Davis,  John  (Yiannis) 

Aloimonos;  University  of  Maryland  .  1 

"USC  Image  Understanding  Research  -  1986", 

R.  Nevatia;  University  of  Southern 

California  .  8 

"Image  Understanding  Research  at  SRI 

International",  Martin  A.  Fischler  and 

Robert  C.  Bolles;  SRI  International  .  12 

"Image  Understanding:  Intelligent  Systems", 

Thomas  0.  Binford;  Stanford  University  .  18 

"Image  Understanding  Research  at  CMU" , 

Takeo  Kanade;  Carnegie-Mellon  University  .  32 

"MIT  Progress  in  Understanding  Images", 

T.  Poggio  and  the  Staff;  Massachusetts 
Institute  of  Technology  .  41 

"Summary  of  Progress  in  Image  Understanding 
at  the  University  of  Massachusetts",  Allen 
R.  Hanson  and  Edward  M.  Riseman;  University 
of  Massachusetts  at  Amherst  .  55 

"Recent  Progress  of  the  Rochester  Image 
Understanding  Project",  Jerome  A.  Feldman 
and  Christopher  M.  Brown;  University  of 
Rochester  .  65 

"Image  Understanding  and  Robotics  Research 
at  Columbia  University",  John  R.  Render, 

Peter  K.  Allen,  Terrance  E.  Boult; 

Columbia  University  . 


71 


TABLE  OF  CONTENTS 


SECTION 


SECTION 


I  ~  PROGRAM  REVIEWS  BY  PRINCIPAL 
INVESTIGATORS  (Continued) 

"Developments  in  Knowledge-Based  Vision 
for  Obstacle  Detection  and  Avoidance", 

K.E.  Olin,  F.M.  Vilnrotter,  M.J.  Daily 
and  K.  Reiser;  Hughes  Research  Laboratories 

"Detecting  Obstacles  in  Range  Imagery", 
by  Michael  J.  Daily,  John  G.  Harris,  and 
Kurt  Reiser;  Hughes  Artificial  Intelligence 
Center  . . . . . 

"Model-Directed  Object  Recognition  on  the 
Connection  Machine",  D.W.  Thompson  and 
J.L.  Mundy;  General  Electric  Corporate 
Research  and  Development  .  98 

"Environmental  Modeling  and  Recognition 
for  an  Autonomous  Land  Vehicle",  Daryl 
T.  Lawton,  Tod  S.  Levitt,  Christopher  C. 

McConnell,  Philip  C.  Nelson,  Jay 


Glicksman;  Advanced  Decision  Systems  .  107 

"Honeywell  Progress  on  Knowledge-Based 
Robust  Target  Recognition  &  Tracking", 

Bir  Bhanu,  Durga  Panda  and  Raj  Aggarwal; 

Honeywell  Systems  &  Research  Center  .  122 


II  -  TECHNICAL  REPORTS  PRESENTED 


"The  ITA  Range  Image  Processing  System", 

David  G.  Morgenthaler ,  Keith  D.  Gremban, 

Mitch  Nathan,  John  D.  Bradstreet;  Martin 
Marietta  Denver  Aerospace  .  127 

"Vision  and  Navigation  for  the  Carnegie 
Mellon  Navlab",  Charles  Thorpe,  Steven 
Shafer,  Takeo  Kanade  and  the  members  of 
the  Strategic  Computing  Vision  Lab, 

Carnegie-Mellon  University  .  143 


I 


I 

TABLE  OP  CONTENTS 


Page 

SECTION  II  -  TECHNICAL  REPORTS  PRESENTED  (Continued) 

"Vision-Based  Navigation:  A  Status 
Report",  Larry  S.  Davis,  Daniel  Dementhon, 


Ramesh  Gajulapalli,  Todd  R.  Kushner, 

Jacqueline  Le  Moigne,  Phillip  Veatch; 

University  of  Maryland  .  153 

"Information  Management  in  a  Sensor-based 

Autonomous  System",  Grahame  B.  Smith  and 

Thomas  M.  Strat;  SRI  International  .  170 


"Tools  and  Experiments  in  the  Knowledge- 
Directed  Interpretation  of  Road  Scenes", 

Bruce  A.  Draper,  Robert  T.  Collins, 

John  Brolio,  Joey  Griffith,  Allen  R.  Hanson, 

Edward  M.  Riseman;  University  of  Massa¬ 
chusetts  at  Amherst  .  178 

"Models  of  Errors  and  Mistakes  in  Machine 
Perception  -  Part  1.  First  Results  for 
Computer  Vision  Range  Measurements", 

Ruzena  Bajcsy,  Eric  Krotkov,  Max  Mintz; 


University- of  Pennsylvania  .  194 

"Automating  Knowledge  Acquisition  for 
Aerial  Image  Interpretation",  David  M. 

McKeown,  Jr.,  Wilson  A.  Harvey;  Carnegie- 
Mellon  University  .  205 

"Using  Generic  Geometric  Models  for  Intel¬ 
ligent  Shape  Extraction",  Pascal  Fua  and 
Andrew  J.  Hanson;  SRI  International  .  227 

"Stereo  Correspondence:  A  Hierarchical 
Approach",  Hong  Seh  Lim  and  Thomas  0. 

Binford;  Stanford  University  .  234 

"Symbolic  Pixel  Labeling  for  Curvilinear 
Feature  Detection",  John  Canning,  J.  John 
Kim,  Azriel  Rosenfeld;  University  of 
Maryland  .  242 

!  ' 

"Searching  for  Geometric  Structure  in 
Images  of  Natural  Scenes",  George  Reynolds, 

R.  Ross  Beveridge;  University  of  Massa¬ 
chusetts  at  Amherst  .  257 


i 

I 


TABLE  OF  CONTENTS 


Page 


SECTION  II  -  TECHNICAL  REPORTS  PRESENTED  (Continued) 


"Detecting  Runways  in  Aerial  Images", 

A.  Huertas,  W.  Cole  and  R.  Nevatia; 

University  of  Southern  California  .  272 

"A  Report  on  the  DARPA  Image  Understanding 
Architectures  Workshop",  Azriel  Rosenfeld; 
University  of  Maryland  .  298 


"Image  Processing  to  Geometric  Reasoning: 

Military  Image  Analysis  at  GE  PESD" , 

M.S.  Horwedel;  General  Electric  Federal 
Electronic  Systems  Division  .  303 

"Image  Understanding  Technology  and  its 
Transition  to  Military  Applications", 

David  Y.  Tseng;  Hughes  Research  Laboratories; 


Julius  F.  Bogdanowicz;  Electo-Optical 

and  Data  Systems  Group,  Hughes  Aircraft 

Company  .  310 

"Context  Dependent  Target  Recognition", 

Teresa  M.  Silberberg;  Hughes  Artificial 
Intelligence  Center  .  313 

"Precompiling  a  Geometrical  Model  into  an 
Interpretation  Tree  for  Object  Recognition 
in  Bin-picking  Tasks",  Katsushi  Ikeuchi; 
Carnegie-Mellon  University  .  321 

"Analytical  Properties  of  Generalized 
Cylinders  and  Their  Projections", 

Jean  Ponce,  David  Chelberg,  Wallace  Mann; 

Stanford  University  . .  340 

"Surface  Segmentation  and  Description  from 
Curvature  Features",  T.J.  Fan,  G.  Medioni, 
and  R.  Nevatia;  University  of  Southern 
California  . 351 


"From  Sparse  3-D  Data  Directly  to  Volumetric 
Shape  Descriptions",  Kashipati  Rao  and 
R.  Nevatia;  University  of  Southern 
California  . . . 


360 


TABLE  OF  CONTENTS 


Page 

SECTION  II  -  TECHNICAL  REPORTS  PRESENTED  (Continued) 

"Object  Recognition  Using  Alignment",  Daniel 
P.  Huttenlocher ,  Shimon  Ullman;  Massa¬ 
chusetts  Institute  of  Technology  .  370 

"Parallel  Recognition  of  Objects  Comprised 

of  Pure  Structure",  Paul  R.  Cooper  and 

Susan  C.  Hollbach;  University  of  Rochester  ...  381 

"A  Framework  for  Implementing  Multi-Sensor 
Robotic  Tasks",  Peter  K.  Allen;  Columbia 
University  .  39  2 

"Shape  from  Darkness:  Deriving  Surface 
Information  from  Dynamic  Shadows",  John  R. 

Render,  Earl  M.  Smith;  Columbia  University  ...  399 


VOLUME  II 

SECTION  III  ~  OTHER  TECHNICAL  REPORTS 

"Vision  and  Visual^EXploration  for  the 

Stanford  Mobile  Robot  7^  Ernst  Triendl  and 

David  J.  Kriegman;  Stanford  University  .  407 

"AuRA)  ^n  Archiectgre  for  Vision-Based 
Robot  Navigation,*,  Ronald  c.  Arkin, 

Edward  M.  Riseman,  Allan  R.  Hanson; 

University  erf  Massachusetts  at  Amherst  .  417 

"Guidinq  an  Autonomous  Land  Vehicle 
losing  jtnowledge-Based  Landmark 
Recognition^,  Hatem  Nasr,  Bir  Bhanu  and 
Stephanie  Schaffer;  Honeywell  Systems 


and  Research  Center  .  432 

"^he  CMU  -Navigational -Architecture" , 

Anthony  Stentz,  Yoshimasa  Goto;  Carnegie- 
Mellon  University  .  440 

"dualitative  -NSvigation",  Tod  S.  Levitt, 

Daryl  T.  Lawton,  David  M.  Chelberg, 

Philip  C.  Nelson;  Advanced  Decision 

Systems  . J .  447 


TABLE  OF  CONTENTS 


Page 


SECTION  III  -  OTHER  TECHNICAL  REPORTS  (Continued) 

/  -  :  ■  '  n  ,  -  '  €  4  •  /  •  .7  - 

"Interpretation  of^err^in  Using 
Hieranghical  Symbolic  Grouping  for 
Mult inspect ral  -Images}*',  Bir  Bhanu  and 
Peter  Symosek;  Honeywell  Systems  & 

Research  Center  .  466 

"design  of  a  Prototype  Interactive 
Cartographic  Display  and  Analysis 
Environment.4,  Andrew  J.  Hanson,  Alex  P. 

Pentland,  and  Lynn  H.  Quam;  SRI 


International  .  475 

"The  image  Understanding  Architecture4, 

Charles  C.  Weems,  Steven  P.  Levitan; 

University  of  Massachusetts  at  Amherst  .  483 

"Constructs  for  Cooperative  image  Under¬ 
standing  Environments."^  Christopher  C. 

McConnell,  Philip  C.  Nelson,  Daryl  T. 

Lawton;  Advanced  Decision  Systems  .  497 

"interpreting  Aerial  Photographs  by 
Segmentation  and  Search"'*  David  Harwood, 

Susan  Chang,  Larry  S.  Davis;  University  of 
Maryland  . 507 


"Initial  Hypothesis  Formation  in  Image 
Understanding  Using  an  Automatically 
Generated  Knowledge  Base4,  Nancy  Bonar 
Lehrer,  George  Reynolds,  Joey  Griffith; 

University  of  Massachusetts  at  Amherst  .  521 

"Goal-Directed  Control  of  Low- Level. 

Processes  for  linage  Interpretation4," 

Charles  A.  Kohl,  Allen  R.  Hanson, 

Edward  M.  Riseman;  University  of  Massa¬ 


chusetts  at  Amherst  . . .  538 

"Active  Vision",  John  (Yiannis)  Aloimonos, 

Isaac  Weiss;  University  of  Maryland;  Amit 
Bandyopadhyay;  SUNY  Stony  Brook . . .  552 


TABLE  OP  CONTBNTS 


Page 

SECTION  III  -  OTHER  TECHNICAL  REPORTS  (Continued) 

"An  Integrated  System  That  Unifies  Multiple 


Shape  Prom  Texture  Algorithms",  Mark  L. 

Moerdler  and  John  R.  Render;  Columbia 
University  . 574 

"DRIVE  -  Dynamic  Reasoning  from  Integrated 
Visual  Evidence"^  Bir  Bhanu  and  Wilhelm 
Burger;  Honeywell  Systems  &  Research 
Center  . 581 

"What  Is  A  "Degenerate"  View?",  John  R. 

Render,  David  G.  Freudenstein ;  Columbia 
University  .  589 


"The  Role  and  Use  of  Color  in  a 

General  vision  System",  Glenn  Healey 

and  Thomas  O.  Binford;  Stanford  University  ...  599 


"Using  a  Color  Reflection  Model  to 

Separate  Highlights  from  Object 

Color",  Gudrun  J.  R1 inker,  Steven 

A.  Shafer,  and  Takeo  Ranade;  Carnegie- 

Mellon  University  .  614 

r 

"Recognizing  Unexpected  Objects^  A 
Proposed  Approach"^  Azriel  Rosenfeld; 

University  of  Maryland  .  620 

y' 

"Parallel  Algorithms  for  Computer  Vision 
on  the  Connection  Machine",  James  J. 

Little,  Guy  Blelloch,  and  Todd  Cass; 

Massachusetts  Institute  6f  Technology  .  628 

"Solving  The  Depth  Interpolation  Problem 
on  a  Pine  Grained,  Mesh-  and  Tree-Connected 
SIMD  Machine.",  Dong  J.  Choi;  Columbia 
University  . . 639 

"Survey  of  ^Parallel ^Computers" ,  Hong  Seh 

Lim  and  Thomas  0.  Binford;  Stanford 

University  .  644 


TABLE  OF  CONTBNTS 


Page 

SECTION  III  -  OTHER  TECHNICAL  REPORTS  (Continued) 

"Evidence  Combination  Using  Likelihood 
Generators" ,  David  Sher;  The  University  of 


Rochester  .  655 

"Multi-Modal  Segmentation  using  Markov 
Random  Fields",  Paul  B.  Chou  and 
Christopher  M.  Brown;  The  University  of 
Rochester  .  663 

"Minimization  of  the  Quantization  Error 
in  Camera  Calibration",  Behrooz  Kamgar- 
Parsi;  University  of  Maryland  . .  671 

"Uncertainty  Analysis  of  Image  Measure¬ 
ments",  M.A.  Snyder;  University  of 
Massachusetts  at  Amherst  .  681 

"Results  of  Motion  Estimation  With  More 
Than  Two  Frames",  Hormoz  Shariat  and 
Keith  E.  Price;  University  of  Southern 
California  . 694 

"Tracing  Finite  Motions  Without  Corre¬ 
spondence",  Ken-ichi  Kanatani  and  Tsai- 
Chia  Chou;  University  of  Maryland  .  704 

"A  Unified  Perspective  on  Computational 
Techniques  for  the  Measurement  of  Visual 
Motion!1,  P.  Anandan;  University  of  Massa¬ 
chusetts  at  Amherst  .  719 

"Hierarchical  Gradient-Based  Motion 
Detection",  Frank  Glazer;  University  of 
Massachusetts  at  Amherst  . .  733 

"Shape  Reconstruction  on  a  Varying  Mesh"  , 

Isaac  Weiss;  University  of  Maryland  .  749 

"Stereo  Matching  Using  Viterbi  Algorithm", 

G.V.S.  Raju;  Ohio  University,  Thomas  0. 

Binford;  Stanford  University,  and  S. 

Shekhar;  Stanford  University  .  766 


TABLE  OF  CONTENTS 


Paae 


SECTION  III  -  OTHER  TECHNICAL  REPORTS  (Continued) 

"Steps  Toward  Accurate  Stereo  Corre¬ 
spondence",  Steven  D.  Cochran;  University 


of  Southern  California  .  777 

"Stereo  Matching  by  Hierarchical,  Micro- 
canonical  Annealing",  Stephen  T.  Barnard; 

SRI  International  .  792 


"The  Formation  of  Partial  3D  Models  from 
2D  Projections  -  An  Application  of  Alge¬ 
braic  Reasoning",  D.  Cyrluk,  D.  Kapur, 

J.L.  Mundy,  and  V.  Nguyen;  General 

Electric  Corporate  Research  and  Development  ..  798 


"Spectral  and  Polarization  Stereo  Methods 

Using  a  Single  Light  Source",  Lawrence 

Brill  Wolff;  Columbia  University  .  810 

"Surface  Curvature  and  Contour  from  Photo¬ 
metric  Stereo",  Lawrence  Brill  Wolff; 

Columbia  University  .  821 

"Qualitative  Information  in  the  Optical 
Flow",  Alessandro  Verri  and  Tomaso  Poggio; 
Massachusetts  Institute  of  Technology  .  825 

"Algorithm  Synthesis  for  IU  Applications", 

Michael  R.  Lowry;  Stanford  University  .  835 


"Generalizing  Epipolar-Plane  Image  Analysis 
for  Non-Orthogonal  and  Varying  View 
Directions",  H.  Harlyn  Baker,  Robert  C. 
Bolles,  and  David  H.  Marimont;  SRI  Inter¬ 


national  . 843 

"Detecting  Dotted  Lines  and  Curves  in 
Random-Dot  Patterns",  Richard  Vistnes; 

Stanford  University  .  849 

"Learning  Shape  Computations",  John  (Yiannis) 
Aloimonos  and  David  Shulman;  University  of 
Maryland  .  862 


► 


TABLE  OF  CONTENTS 


Page 

SECTION  III  -  OTHER  TECHNICAL  REPORTS  (Continued) 

"Local  Shape  from  Specularity",  Glenn 

Healey  and  Thomas  0.  Binford;  Stanford 

University  .  874 

"Finding  Object  Boundaries  Using  Guided 

Gradient  Ascent",  Yvan  Leclerc  and  Pascal 

Fua;  SRI  International  . .  888 

"Detecting  Blobs  as  Textons  in  Natural 
Images",  Harry  Voorhees  and  Tomaso  Poggio; 
Massachusetts  Institute  of  Technology  .  892 

"Reconstruction  of  Surfaces  from  Profiles", 

Peter  Giblin  and  Richard  Weiss;  University 
of  Massachusetts  at  Amherst  .  900 

"Using  Occluding  Contours  for  Object 
Recognition",  Willie  Lim;  Massachusetts 
Institute  of  Technology  . 909 

"Parallel  Optical  Flow  Computation",  James 
Little,  Heinrich  Bulthoff  and  Tomaso  Poggio; 
Massachusetts  Institute  of  Technology  .  915 

"Using  Optimal  Algorithms  to  Test  Model 
Assumptions  in  Computer  Vision",  Terrance  E. 

Boult;  Columbia  University  .  921 

"A  Parallel  Implementation  and  Exploration 
of  Witkin’s  Shape  from  Texture  Method", 

Lisa  Gottesfeld  Brown  and  Hussein  A.  H. 

Ibrahim;  Columbia  University  .  927 

"Localized  Intersections  Computation  for 
Solid  Modelling  with  Straight  Homogeneous 
Generalized  Cylinders",  Jean  Ponce  and 
David  Chelberg;  Stanford  University  .  933 


"Line-Drawing  Interpretation:  Straight- 
Lines  and  Conic-Sections",  Vic  Nalwa; 
Stanford  University  . 


942 


TABLE  OF  CONTENTS 


SECTION  III  -  OTHER  TECHNICAL  REPORTS  (Continued) 

"Line-Drawing  Interpretations  Bilateral 
Symmetry",  Vic  Nalwa;  Stanford  University  ....  956 

"Estimation  and  Accurate  Localization  of 
Edges",  Fatih  Ulupinar  and  Gerard  Medioni; 


University  of  Southern  California  .  968 

Edge-Detector  Resolution  Improvement  by 
Image-Interpolation" ,  Vishvjit  S.  Nalwa; 

Stanford  University  .  981 

"Detection,  Localization  and  Estimation 
of  Edges",  J.S.  Chen  and  G.  Medioni; 

University  of  Southern  California  .  988 


AUTHOR  INDEX 


I 


NAME 

PAGE 

Aggarwal ,  R. 

122 

Allen,  P.  K. 

71, 

392 

Aloimonos,  J.  Y. 

1, 

552, 

862 

Anandan,  P. 

719 

Arkin,  R.  C. 

417 

Bajcsy,  R. 

194 

Baker,  H.  H. 

843 

Bandyopadhyay,  A. 

552 

Barnard,  Stephen  T. 

792 

Beveridge,  J.  R. 

257 

Bhanu,  B. 

122, 

466, 

581 

Binford,  T.  0. 

18, 

234, 

599 

Blelloch,  G. 

628 

Bogdanowicz,  J.  F. 

310 

Bolles,  R.  C. 

12, 

843 

Boult,  T.  E. 

71, 

921 

Bradstreet,  J.  D. 

127 

Brolio,  J. 

178 

Brown,  C.  M. 

65, 

663 

Brown,  L.  G. 

927 

Bulthoff,  H. 

915 

Burger,  H. 

581 

Canning,  J. 

242 

Cass,  T. 

628 

i 


i 


AUTHOR  IND£X  (Continued) 


NAME 

Chang ,  S. 

C he! berg,  D. 
Chelberg,  D.  M. 
Chen,  J.  S. 

Choi,  D.  J. 

Chou,  P.  B. 

Chou,  T. 

Cochran,  S.  D. 
Cole,  H. 

Collins,  R.  T. 
Cooper,  p.  R. 
Cyrluk,  D. 

Daily,  M.  J. 

Davis,  L.  S. 
Dementhon,  D. 
Draper,  B.  A. 

Fan,  T.  J. 

Feldman,  J.  A. 
Fischler,  M.  A. 
Freudenstein,  D.  G. 
Fua,  P. 

Gajulapalli,  R. 
Giblin,  P. 

Glazer,  F. 


PAGE 

507 

340,  933 

447 

988 

639 

663 

704 

777 

272 

178 

381 

798 

78,  87 
1,  153,  507 
153 
178 
351 
65 
12 
589 

227 ,  888 

153 

900 


733 


AUTHOR  INDEX  (Continued) 


NAME 

PAGE. 

Glicksman,  J. 

107 

Goto,  Y. 

440 

Gremban,  K.  D. 

127 

Griffith,  J. 

178, 

521 

Hanson,  A.  J. 

227, 

475 

Hanson,  A.  R. 

55, 

178,  417, 

538 

Harris,  J.  G. 

87 

Harvey,  W.  A. 

205 

Harwood,  D. 

507 

Healey,  6. 

599, 

874 

Hollbach,  S.  C. 

381 

Horwedel ,  M.  S. 

303 

Huertas,  A. 

272 

Huttenl ocher,  D.  P. 

370 

Ibrahim,  H.  A.  H . 

927 

Ikeuchi ,  K . 

321 

Kamgar-Parsi ,  B. 

671 

Kanade,  T. 

32, 

143,  614 

Kanatani,  K. 

704 

Kapur,  D. 

798 

Render,  J.  R. 

71, 

399,  574, 

589 

Kim,  J.  J. 

242 

K1 inker,  G.  J. 

614 

AUTHOR  INDEX  (Continued) 


NAME 

Kohl,  C.  A. 
Kriegman,  D.  J. 
Krotkov,  E. 
Kushner,  T.  R. 
Lawton,  D.  T. 
Leclerc,  Y. 
Lehrer,  N.  B. 

Le  Moigne,  J. 
Levitan,  S.  P. 
Levitt,  T.  S. 

Lim,  H.  S. 

Lim,  W. 

Little,  J.  J. 
Lowry,  M.  R. 

Mann,  U. 

Marimont,  D.  H. 
McConnell,  C.  C. 
McKeown,  D.  M. 
Medioni,  G. 

Mintz,  M. 

Moerdler,  M.  L. 
Morgenthaler,  D.  G. 
Mundy,  J.  L. 

Nalwa,  V.  S. 


mg 

538 

407 

194 

153 

107,  447,  497 
888 
521 
153 
483 

107,  447 
234,  644 
909 

628,  915 
835 
340 
843 

107,  497 
205 

351,  968,  988 

194 

574 

127 

98,  798 
942,  956,  981 


iv 


AUTHOR  INDEX  (Continued) 


NAME 

PAGE 

Nasr,  H. 

432 

Nathan,  M. 

127 

Nelson,  P.  C. 

107,  447,  497 

Nevatia,  R. 

8,  272,  351,  360 

Nguyen,  V. 

798 

01  in,  K.  E. 

78 

Panda,  D. 

122 

Pentland,  A.  P. 

475 

Poggio,  T. 

41,  825,  892,  915 

Ponce,  J. 

340,  933 

Price,  K.  E. 

694 

Qu am,  L.  H. 

475 

Raju,  G.  V.  S. 

766 

Rao,  K. 

360 

Reiser,  K. 

78,  87 

Reynolds,  G. 

257,  521 

Riseman,  E.  M. 

55,  178,  417,  538 

Rosenfeld,  A. 

1,  242,  298,  620 

Schaffer,  S. 

432 

Shafer,  S.  A. 

143,  614 

Shariat,  H. 

694 

Shekhar,  S. 

766 

Sher,  D. 

655 

Shulman,  D. 

862 

AUTHOR  INDEX  (Continued) 


NAME 

PAG£ 

Silberberg,  T.  M. 

313 

Smith,  E.  M. 

399 

Smith,  G.  B. 

170 

Snyder,  M.  A. 

681 

Stentz,  A. 

440 

Strat,  T.  M. 

170 

Symosek,  P. 

466 

Triendl,  E. 

407 

Tseng,  D.  Y . 

310 

Thompson,  D.  W. 

98 

Thorpe,  C. 

143 

Ullman,  S. 

370 

Ulupinar,  F. 

968 

Veatch,  P. 

153 

Verri,  A. 

825 

Vi lnrotter,  F.  M. 

78 

Vistnes,  R. 

849 

Voorhees,  H. 

892 

Weems,  C.  C. 

483 

Weiss,  I. 

552, 

749 

Weiss,  R. 

900 

Wolff,  L.  B. 

810, 

821 

SECTION  III 


OTHER 


TECHNICAL  REPORTS 


Vision  and  Visual  Exploration  for  the  Stanford  Mobile  Robot 

Ernst  Triendl  and  David  J.  Kriegman 


Artificial  Intelligence  Laboratory 
Stanford  University 
Stanford,  CA  94305 


ABSTRACT 

Soft  modeling,  stereo  vision,  motion  plan¬ 
ning,  uncertainty  reduction,  image  processing, 
and  locomotion  enable  the  Mobile  Autonomous 
Robot  Stanford  to  explore  a  benign  indoor  en¬ 
vironment  without  human  intervention.  Details 
of  the  first  successful  run  are  presented. 


1.  Introduction 

By  the  end  of  1986  the  Mobile  Autonomous  Robot 
Stanford  (MARS)  or  Mobi  was  able  to  move  relatively 
freely  under  its  own  guidance  inside  our  laboratory  build¬ 
ing. 

The  rational  behind  the  vision  and  motion  planning 
system  is  to: 

1.  be  fast  enough  to  allow  perceptible  movement,  and 
even  fast  motion  when  special  hardware  is  added  in 
some  distant  future 

2.  to  understand  enough  about  a  benign  indoors  envi¬ 
ronment  to  move  about  and  recognize  the  more  im¬ 
portant  elements  of  a  building. 

3.  to  move  under  its  own  guidance,  exploring  rooms 
without  human  intervention. 

These  goals  are  achieved  by  a  combination  of  model¬ 
ing  and  vision.  Modeling  tells  us  that  buildings  have  walls, 
doors,  floors  and  windows  that  obey  certain  relations  to 
each  other  and  the  vehicle.  The  vision  system  determines 

Support  for  thiz  work  was  provided  by  the  Air  Force  Office  of  Scien¬ 
tific  Research  under  contract  F33615-85-C-S106,  Image  Understand¬ 
ing  contract  N00039-84-C-02 1 1  and  Autonomous  Land  Vehicle  con¬ 
tract  AIDS-1085S-1.  David  Kriegman  was  supported  by  afellowship 
from  the  Fannie  and  John  Hertz  Foundation. 


the  regions  of  free  space  and  location  of  obstacles.  The 
motion  planning  system  generates  moves  that  are  likely 
to  increase  the  knowledge  about  free  space  so  that  further 
moves  will  be  possible. 

The  examples  in  this  paper  are  from  the  first  con¬ 
vincingly  successful  run  on  Friday,  October  18,  1986.  Left 
to  its  own  devices  Mobi  moved  down  the  hallway,  made  a 
tour  of  the  lobby  and  came  back  into  the  hallway,  traveling 
a  total  distance  of  35  meters. 

We  will  now  describe  the  robot,  the  model  and  its 
implication  for  vision,  the  vision  system,  free  space  and 
motion  planning  with  uncertainty  propagation  and  finally 
show  details  of  Mobi’s  excursion. 

2.  The  Robot 

The  vision,  modeling  and  navigation  system  de¬ 
scribed  in  this  paper  was  implemented  on  an  omnidirec¬ 
tional  vehicle.  Mobi  (fig  1]  is  propelled  by  three  wheels 
that  are  mounted  on  the  edges  of  an  equilateral  triangle. 
Each  wheel  has  passive  rollers  along  its  chord,  permitting 
the  wheel  to  move  sideways  while  being  driven.  This  al¬ 
lows  simultaneous  rotation  and  translation  over  relatively 
smooth  terrain.  A  motor  drives  each  wheel  and  a  shaft 
encoder  counts  the  number  of  rotations.  The  counts  are 
mapped  through  kinematics  of  the  vehicle  and  integrated 
to  determine  the  change  in  the  robot’s  position. 

This  robot  has  four  sensing  modalities:  vision,  acous¬ 
tics,  tactile  and  odometry.  The  stereo  vision  system  uses 
two  cameras  mounted  on  a  pan/tilt  head  to  learn  the  de¬ 
tails  of  its  environment.  The  acoustic  system  is  composed 
of  twelve  Polaroid  sensors  equally  spaced  around  the  cir¬ 
cumference  of  the  robot.  These  sensors  return  a  distance 


407 


Figure  1.  MARS  aka  Mobi 

that  is  proportional  to  the  time  of  flight  of  the  echo  of 
an  acoustic  chirp  which  the  sensor  emits.  Essentially,  the 
sensor  finds  the  distance  to  the  nearest  object  within  a 
30°  cone.  This  information  can  either  be  integrated  into  a 
map  of  t  he  environment  or  simply  used  for  guarded  moves 
against  the  unexpected.  The  tactile  sensing  system  in 
the  bumpers  uses  segmented  tape  switches  to  determine 
the  point  of  contact  in  a  collision.  Besides  being  used 
for  emergency  collision  detection,  tactile  sensing  has  also 
been  used  for  moving  through  doorways;  the  edges  of  the 
doors  are  too  close  to  the  robot  for  the  vision  and  acoustic 
systems  to  sense  so  only  odometry  and  contact  sensors  are 
available.  More  details  about  how  the  acoustic  and  tactile 
systems  are  used  can  be  found  in  [Kriegman]. 

Mobi’s  distributed  computational  system  resides  par¬ 
tially  on  the  vehicle  as  well  as  offboard.  Onboard,  each 
wheel  is  controlled  by  an  8  bit  microprocessor,  running 
a  simple  position  controller.  These  controllers  are  com¬ 
manded  by  a  sixteen  bit  computer  (National  Semiconduc¬ 
tor  32016)  which  is  also  responsible  for  trajectory  plan¬ 
ning,  sensor  data  acquisition  and  offboard  communica¬ 
tion.  Most  of  the  ’’intelligence”  is  performed  offboard 
with  the  onboard  computer  executing  simple  commands 


and  handling  real  time  emergency  situations  such  as  stop¬ 
ping  when  the  tactile  bumpers  detect  a  crash.  The  robot 
communicates,  via  a  digital  radio  link,  with  a  (necessarily 
offboard)  Symbolics  Lisp  Machine  which  is  responsible  for 
planning  and  building  a  world  model.  A  TV  transmitter 
sends  the  video  signal  from  the  cameras  to  the  digitizer. 
From  there  the  image  is  transferred,  via  DMA  link,  to  a 
VAX  which  does  image  processing  and  computes  stereo 
correspondence  points  that  are  sent  to  the  Lisp  Machine 
over  an  ethernet. 


3.  Modeling  and  the  Environment 

We  expect  the  robot  to  explore  a  benign  indoor  en¬ 
vironment  without  human  intervention.  A  model  is  used 
to  represent  knowledge  about  this  environment  and  helps 
to  execute  tasks  within  it. 

The  base  model  consists  of  a  flat  floor  that  carries 
vertical,  straight  walls  with  hinged  doors.  Arrangements 
of  walls  form  rooms  and  hallways.  Walls  are  very  likely  to 
be  parallel  or  at  right  angles  with  respect  to  each  other. 

In  the  actual  implementation  we  allow  surface  marks 
on  the  walls  and  a  class  “Other  Object”  to  cope  with 
things  that  the  stereo  vision  system  sees,  but  modeling 
can  not  explain,  and  the  motion  planner  should  not  push 
over. 


Figure  2.  Hierarchy  of  Model 


A08 


Figure  2  shows  the  hierarchy  that  might  describe  a 
hallway  in  a  building.  At  the  lowest  level  are  the  points 
and  lines  found  by  the  sensors.  This  information  is  fit  to 
a  model  of  generic  objects  such  as  walls,  doors,  and  the 
floor.  Two  parallel  walls  that  bound  an  elongated  region 
of  free  space  could  be  a  hall,  and  hallways  are  found  in 
buildings.  So,  when  searching  for  a  route  between  rooms 
in  a  building,  w'e  can  first  search  at  the  building  level  and 
then  find  a  path  along  successive  levels  of  the  map.  A 
graphic  of  an  instantiation  of  this  model  containing  all 
possible  objects  and  their  joining  edges,  is  drawn  in  fig  3. 


'to  0mm 


Figurt  3.  Instantiation  of  Model 

This  choice  of  model  has  implications  for  the  vision 
system:  All  edges  of  structural  elements  in  our  model  are 
either  vertical  or  horizontal.  Note  the  edges  in  the  stereo 
pair  of  images  in  figure  4.  All  vertical  edges  cross  a  hori¬ 
zontal  plane  at  camera  height  (otherwise  Mobi  could  not 
fit  through  doors).  Vertical  edges  alone  allow  the  instan¬ 
tiation  of  the  model  with  one  ambiguity:  \Ve  may  confuse 
a  gap  in  the  wall  with  the  wall,  unless  some  other  object 
is  seen  through  the  gap.  Looking  at  the  floor  edge  would 
possibly  resolve  this  situation. 

Whether  one  should  look  at  'lie  floor  at  all  is  also 
a  question  of  good  usage  of  processing  resources:  When 
all  information  about  the  model  can  be  gained  by  looking 
horizontally,  looking  down  will  slow  the  process  (consid¬ 
erably  as  it  turned  out).  On  the  other  hand  looking  down 
occasionally  is  needed  to  avoid  obstacles  on  the  floor.  Our 
Robot  does  not  do  so  now  and  consequently  rams  into 
chairs,  flower  pots  (if  small)  and  couches. 


4.  The  Vision  System 

Image  Preprocessing  and  Edge  Detection 

Images  from  both  cameras  are  digitized  with  a  res¬ 
olution  of  256  picture  elements  per  line.  A  one  dimen¬ 
sional  version  of  the  edge  appearance  model  is  applied  to 
an  epipolar  line  produced  by  averaging  a  few  lines  at  the 
horizon,  causing  vertical  smearing  of  the  images.  (The 
horizon  is  known  from  the  calibration  program.) 

This  vertical  smearing  has  the  following  effects: 

1.  Vertical  edges  retain  their  acuity. 

2.  Slanted  edges  get  blurred,  horizontal  edges  vanish. 

3.  Tilt  and  roll  angle  misalignment  have  less  effect. 

4.  Image  noise  is  reduced. 

5.  Blobs  become  vertical  edges. 

The  last  effect  can  be  applied  to  a  different  scenario: 
A  busy  road  at  night  with  the  glare  of  headlights.  After 
passing  the  vertical  averaging  filter,  these  blobs  of  light 
would  become  stripes  that  cross  the  horizon  and  give  rise 
to  edges  suitable  for  our  algorithm. 

Vertical  edges  of  walls  and  doors  in  our  model  re¬ 
main  vertical  and  sharp  after  this  operation  and  will  be 
recognized  with  high  localization  precision  by  the  edge 
appearance  operator. 

The  edge  appearance  model  [Triendl  1978]  compares 
a  local  patch  of  digital  image,  e.g.  a  5  by  5  pixel  area, 
to  the  image  that  would  have  been  created  if  the  camera 
were  looking  at  an  ideal  step  edge.  It  uses  the  spatial  filter 
created  by  the  lens-camera-digitizer-preprocessor  pipeline 
for  this  purpose.  The  operator  returns  quality,  position, 
direction,  left  and  right  grey  levels  of  the  edge  element.  It 
has  infinite  spatial  resolution  (1/8  pixel  in  practice)  and 
degrades  gracefully  for  non-ideal  edges. 


Fignrt  4 ■  Stereo  pair  of  images  seen  by  Mobi 


409 


For  locations  without  ideal  edges  but  high  gradient 
the  same  algorithm  is  applied  at  half  resolution,  because 
some  structural  edges  may  not  be  sharp  enough  (rounded 
corners)  or  ill  defined  in  detail  (reflections  or  fine  struc¬ 
tures  along  a  corner).  These  edges,  multiple  edges  that 
cannot  be  combined  and  locations  with  a  high  gradient 
that  to  not  correspond  to  ideal  edges,  have  a  larger  esti¬ 
mated  localization  error. 

The  algorithm  is  trimmed  for  speed  and  takes  less 
than  one  second  on  a  VAX. 

Alternative  Edge  Extractions 

Before  arriving  at  the  above  solution  we  explored  sev¬ 
eral  alternatives.  Some  of  them  are  still  available  to  be 
applied  later  for  more  specific  purposes  such  as  looking 
for  a  floor  edge 

The  first  experiment  was  to  apply  the  edge  detector  to 
the  whole  image  and  group  and  link  edges  afterwards  into 
lines.  Even  after  transferring  the  first  stages  of  the  edge 
detector,  several  5  by  5  convolution  filters,  to  a  pipeline 
processor,  it  took  several  minutes  to  process  a  stereo  im¬ 
age. 

The  next  alternative  was  to  combine  edge  detection 
and  linking.  The  next  edgel  is  expected  at  a  position  pre¬ 
dicted  by  the  previous  one.  Processing  time  was  reduced 
to  15  seconds  per  image.  Most  of  the  processing  went  into 
the  search  for  new  edge  links.  Reducing  this  search  risks 
loosing  short  and  marginal  chains  of  edges. 

The  resulting  lines  were  labeled  ‘straight’  and  ‘bent’ 
and  connected  with  other  lines  forming  corners,  T- 
junction,  Y-junctions  and  arrows.  These  were  intended 
to  label  the  resulting  line-graph  according  to  the  model 
and  combine  monocular  and  binocular  stereo.  Again  pro¬ 
cessing  time  was  too  long,  and  in  addition  lines  were  easily 

broken  by  door  knobs,  labels  on  the  wall  and  the  likes,  so 
that  stereo  pairs  were  difficult  to  determine.  Some  ob¬ 
jects,  e.g.  the  previously  mentioned  flower  pot,  did  not 
give  sufficiently  good  and  consistent  edges  for  a  stereo 
match. 

Non-vertical  lines  that  fit  the  model  are  floor-wall  and 
floor-door  edges.  We  found  that  these  edges  to  be  often 
outside  the  field  of  view  of  the  camera,  too  low  in  contrast, 
ill  defined,  or  obstructed  by  furniture.  When  extracting 
vertical  lines  the  image  can  be  enhanced  by  application 
of  a  vertical  low  pass  filter,  i.e.  a  moving  average  over  a 
few  lines.  The  line  follower  won’t  get  thrown  off  by  a  tack 


in  a  door  frame.  All  important  information,  except  edge 
length,  is  provided  by  the  first  edge  of  a  vertical.  So  finally 
we  drop  angular  sensitivity  of  the  edge  detector  and  line 
following  and  arrive  at  our  present  solution. 


Figure  5.  Stereo  pair  with  occlusions 
Stereo  Algorithm 

We  can  never  be  certain  that  a  stereo  match  is  cor¬ 
rect,  since  for  any  possible  stereo  match  of  edges,  we  can 
find  a  possible  if  far  fetched  physical  explanation.  Even 
in  the  series  of  pictures  from  our  test  run  discussed  be¬ 
low,  we  find  matching  edges  that  appear  very  different, 
for  example  in  figure  5.  What  we  can  do  is  tap  all  avail¬ 
able  sources  of  information  and  cues  to  make  the  chance 
of  correct  matches  in  the  real  world  as  high  as  possible. 
See  [Baker  1982]  and  [Tsuji  1986]  for  other  solutions. 

The  stereo  mechanism  along  an  epipolar  line  that  we 
finally  settled  on  uses  edges,  grey  levels,  correlation  of 
intensities,  constellations  of  edges  and  constraint  propa¬ 
gation.  It  tries  to  preserve  left  to  right  ordering  of  edge 
matches  but  allows  violations,  e.g.  by  a  pillar  in  the  mid¬ 
dle  of  a  room.  Multiple  choices  of  matches  are  kept  along 
if  needed. 

In  the  first  stage  all  possible  stereo  edge  matches 
weighted  by 

•  similarity  of  edge  step  size  (positive  or  negative 
weight) 

•  grey  level  similarity  on  left  and  right  side  of  the  edge 
(only  positive  weight).  This  handles  those  occluding 
edges  which  have  similar  grey  levels  on  one  side  only. 

•  similarity  in  grey  level  correlation  near  the  edges. 
This  measure  is  independent  of  the  widely  varying 
camera  gains  and  offsets  and  responds  to  differences 
in  edge  cross  sections. 


410 


Matches  with  a  positive  weight  are  considered  candi¬ 
dates  at  this  stage.  Note  that  we  did  not  loose  matches  of 
only  one  side  of  the  edge  and  that  we  keep  matches  with 
similar  edge  cross  sections  even  if  their  grey  levels  are  dif¬ 
ferent.  A  sample  of  the  resulting  situation  [figure  7]  shows 
edges  and  stereo  matches  considered  at  this  point  super¬ 
imposed  to  the  grey  level  curve  of  left  and  right  epipolar 
line  of  a  sample  scene. 

The  grey  level  comparison  function  that  we  use  is 
similar  to  the  normalized  cross  correlation  but  takes  dif¬ 
ferences  in  standard  deviation  and  mean  grey  levels  into 
account.  Since  intervals  along  the  epipolar  line  may  not 
have  the  same  length  and  the  interval  boundaries  may 
be  edges  at  subpixel  accuracy,  grey  levels  are  resampled 
by  linear  interpolation  at  approximately  pixel  steps.  The 
ratio  of  interval  lengths  also  contributes  to  the  compar¬ 
ison  function  since  it  is  unlikely  to  see  the  same  surface 
from  very  differing  angles  and  similar  interval  lengths  are 
therefore  likely. 


Matching  to  the  Model:  Walls  and  Doors 

Every  pair  of  matched  points  derived  from  neighbor¬ 
ing  edges  is  potentially  part  of  a  wall.  Angles  of  the  lines 
connecting  these  points  are  clustered  with  weights  equal  to 
their  distances.  The  floor  is  then  searched  for  linear  con¬ 
stellations  of  points  along  angles  corresponding  to  cluster 
maxima.  Straight  lines  are  fitted  to  these  constellations 
and  the  best  fit  is  considered  to  signify  a  wall  surface. 
Other  fits  with  the  same  angle  or  perpendicular  to  that 
angle  are  also  considered  to  be  a  wall  surface.  Walls  are 
further  labeled  ‘leftwall’  and  Tightwall’  if  they  pass  by  the 
left  or  right  side  of  the  vehicle  respectively. 

Any  pair  of  points  60cm  to  140cm  apart  is  a  candidate 
for  a  door  frame.  If  the  corresponding  edges  have  big 
brightness  differences  with  the  dark  side  and  the  inner 
side  of  the  frame  a  door  is  assumed.  It  is  confirmed  if  it  is 
on  a  wall.  If  the  space  between  those  edges  does  not  have 
a  strong  grey  level  correlation  the  door  is  not  closed. 


1 

'''y/V) - 

^  tvVAiKrMX  r — 

. _ 

4  12  21  > 

1  4  71  2  11 

1  12  2  24 

12  IK 

-f1!- 

F.jare  6.  Stereo  match  proposals  and  grey  level  curves 


Next  we  deal  with  local  consistency.  First  the  grey- 
level-comparison-function  is  applied  to  the  intervals  be¬ 
tween  pairs  of  matching  edges  and  their  respective  first 
and  second  neighbors  to  the  left.  By  comparing  these  4 
combinations  per  match,  marginal  edges  present  in  only 
one  image  can  be  pinpointed  and  eliminated.  As  a  side 
effect,  high  correlation  makes  it  likely  that  the  ribbon  be¬ 
tween  edges  results  from  a  solid  object. 

bocal  consistency  links  neighbors  and  can  tell  us 
which  neighboring  match  is  more  consistent.  These  neigh¬ 
bor  links  form  paths  of  maximum  consistency  that  link 
groups  of  likely  match  choices.  Simply  summing  up  match 
qualities  in  a  group  represents  an  effective  means  of  con¬ 
straint  propagation.  A  group  of  equal  bars  or  a  checker¬ 
board  that  is  entirely  visible  will  be  matched  correctly  this 
way.  In  the  case  of  an  occluding  edge  the  constraint  will 
propagate  up  to  the  edge,  and  the  part  visible  only  to  one 
camera  will  tend  to  be  left  unmatched. 

Finally,  if  a  previous  image  is  available,  a  motion 
match  that  occurs  within  the  limits  of  motion  prediction, 
may  resolve  any  remaining  ambiguity. 


figure  7.  shows  the  state  of  the  model  as  derived 
from  a  single  view  of  the  hallway  of  our  lab.  An  additional 
classification  has  been  introduced  here:  ‘hallway'  for  a 
room  that  is  very  narrow  and  long. 


5.  Free  Space  and  Motion  Planning 

Free  space  is  the  volume,  or  in  our  case,  the  floor 
area,  that  is  free  from  obstacles.  Initially  it  is  the  area 
occupied  by  the  vehicle,  a  roughly  circular  polygon  65cm 
in  diameter.  The  arrangement  of  robot,  cameras  and  fields 
of  view  is  drawn  in  figure  8. 


411 


Figun.  8.  Geometry  of  robot,  cameras  and  field  of  view. 

When  the  stereo  system  finds  a  match  the  rays  be¬ 
tween  both  cameras  and  the  object  go  through  free  space 
if  the  match  was  correct  and  the  physical  light  rays  did 
not  go  through  glass  panes  or  were  reflected  by  a  mirror. 
The  stereo  system  delivers  matches  in  a  polar  sort  about 
the  robot.  We  now  make  the  worst  case  assumption  that 
the  connection  between  neighboring  matches  represents  a 
solid  object.  This  includes  real  walls  with  both  ends  visi¬ 
ble.  openings  in  walls  without  objects  seen  through  them 
and  lines  connecting  otherwise  unrelated  matches. 

The  hull  of  lines  connecting  neighboring  matches  and 
lines  connecting  matches  with  each  camera  represents  our 
visual  free  space  [figure  9],  Total  free  space  is  the  ‘OR'ing 
of  the  footprint  of  the  vehicle  with  free  space  derived  from 
vision  and  other  sources. 


Figure  9.  Free  space  and  stereo  matches 

A  safe  move  will  keep  the  vehicle  entirely  in  free 
spare.  From  figure  9  we  notice  that  no  safe  move  is  possi¬ 
ble  from  a  single  view  since  there  is  a  bottleneck  between 
footprint  and  visual  free  space. 

If  the  robot  has  already  moved,  the  new  free  space 
consists  of  the  superposition  of  the  new  visual  free  space 
to  the  previous  visual  free  space  and  the  space  swept  out 
during  the  motion.  Old  freespace  has  to  be  shrunk  ac¬ 
cording  to  motion  uncertainty. 


If  the  robot  has  already  moved,  the  new  free  space 
consists  of  the  superposition  of  the  new  visual  free  space 
to  the  previous  visual  free  space  and  the  space  swept  out 
during  the  motion.  Old  freespace  has  to  be  shrunk  ac¬ 
cording  to  motion  uncertainty. 

To  allow  Mobi  to  move  we  tell  it  initially  that  it  has 
enough  space  to  rotate  in  place  and  position  it  accordingly 
at  least  5  cm  from  the  next  obstacle.  The  motion  plan¬ 
ner’s  top  goal  is  to  generate  enough  free  space  to  allow 
motion.  In  particular  the  vehicle  will  rotate  until  it  has 
seen  enough  to  be  able  to  find  a  save  forward.  The  motion 
planner  works  as  follows: 

1.  Intersect  free  space  with  a  circle  with  the  diameter 
equal  to  twice  the  maximum  step  size  whose  center 
is  the  robot. 

2.  for  each  intersecting  arc  find  the  longest  legal  move 
in  the  direction  of  the  center  of  the  arc. 

3.  the  robot's  orientation  is  determined  by  repeating 
steps  I  and  2  from  the  new  position.  At  the  first 
position  the  robot  will  face  the  second  one. 

■I.  If  there  are  several  moves,  or  several  orientations 
make  a  random  choice. 

5.  If  no  move  w  f-"ind  rotate. 


Uncertainty  Reduction 

,\o  *he  'ujjt  moves,  its  location  is  determined  by 
odometry.  However,  due  to  wheel  slippage,  finite  shaft 
encoder  resolution,  backlash,  as  well  as  other  problems, 
there  is  some  uncertainty  in  measuring  wheel  positions. 
This  is  especially  significant  with  odometry  because  error 
is  integrated  with  successive  motion.  Furthermore,  there 
is  some  uncertainty  in  measurement  of  the  location  of  cor¬ 
respondence  points  from  stereo  vision.  Thus,  when  the 
robot  views  a  point  in  space  (e.g.  a  door  frame)  from  two 
locations,  the  point  will  appear  to  be  at  two  distinct  lo¬ 
cations  unless  one  considers  uncertainty  [Brooks, Chatila). 
For  some  measurements,  the  easiest  solution  may  be  to 
ignore  it  by  just  checking  fixed  intervals  about  any  point 
(e.g.  two  points  are  the  same  if  they’re  within  2  inches). 
A  more  sophisticated  approach  than  representing  points 
by  their  coordinates,  is  to  represent  them  as  the  probabil¬ 
ity  that  the  point  is  at  certain  location  (i.e.  probability- 
density  function).  In  this  work,  all  points  are  represented 


412 


by  a  multivariate  normal  distribution  with  a  mean  vector 
p  and  covariance  matrix,  S.  Higher  order  moments  or 
other  distributions  may  provide  more  information  but  are 
much  more  complex  and  would  require  great  computation 
time. 

All  stereo  correspondences  are  made  at  a  location  Pt 
which  can  be  related  to  the  robot’s  previous  location  P,_i 
by  a  transform,  T.-i,.  Since  mobi  only  travels  inside 
buildings,  and  the  floor  is  basically  level,  the  transforms 
only  need  to  represent  the  Cartesian  location  and  the  an¬ 
gle  of  rotation,  (x,y,0).  From  odometry,  we  can  deter¬ 
mine  the  mean  and  covariance  of  the  motion  which  we 
have  taken  to  be  a  function  of  distance  traveled.  The 
stereo  measurements  are  also  taken  to  have  uncertainty 
due  to  localization  of  edges  in  the  two  images.  It  should 
be  noted  that  the  standard  deviation  of  the  error  in  dis¬ 
tance  to  a  point  grows  as  the  square  of  the  distance  to  the 
point  w'hereas  localization  uncertainty  in  the  directions 
parallel  to  the  cameras’  focal  plane  grows  proportional  to 
distance. 


B 


Figure  10A  shows  the  uncertainty  ellipses  of  two 
points  ( C'a  and  C(.)  that  were  viewed  from  two  different  lo¬ 
cations,  (Pi  and  P2).  This  figure  shows  that  there  was  an 
error  in  motion  transform,  T12  as  determined  by  odome¬ 
try  because  correspondence  points  C0l  and  should  be 
at  the  same  location  as  well  as  points  Q,  and  C\2.  How¬ 
ever,  consider  the  uncertainty  in  the  location  >f  the  robot 
at  P2,  as  represented  by  the  ellipse  about  its  location  in 
fig.  10B.  The  uncertainty  in  the  location  of  the  stereo 
points  Ct3  and  Cjj  with  respect  to  location  P\  are  found 
by  compounding  [Smith],  also  shown  in  fig.  10B.  Now.  we 
can  determine  that  indeed  points  and  C’^  are  actually 
measurements  of  the  same  point,  and  so  we  can  now  ap¬ 
ply  a  Kalman  filter  to  reduce  the  uncertainty  in  both  the 
stereo  correspondences  and  the  motion  transform.  The 
transform  and  the  correspondence  points  as  seen  from  P\ 
become  our  initial  state  vector  along  with  their  covari¬ 
ances.  We  assume  that  the  stereo  measurements  as  well 
as  the  transform  are  independent.  The  correspondence 
points  as  seen  from  Pi  becomes  the  measurement  vector, 
and  we  just  apply  the  Kalman  filter  [Junkins]  to  obtain 
figure  IOC.  Note  that  the  uncertainty  of  the  motion  has 
been  reduced  as  has  the  location  of  the  correspondence 
points  as  viewed  from  both  positions.  With  this,  we  can 
integrate  motion  sequences  to  a  greater  level  of  accuracy. 


Figure  10.  Uncertainty  reduction. 

A:  Two  points  seen  from  different  locations 
B:  Uncertainty  with  respect  to  Pi 
C:  Uncertainties  combined 


Navigation  with  positions  derived  from  correspon 
dence  points  proved  to  be  much  more  precise  than  odom 
etry. 


Figure  11.  Example  for  visual  navigation:  The  robot 
ikons  show  the  derived  position,  while  the  arrows  indicate 
predicted  motion  relative  to  the  previous  position.  The 
arrow  tails  indicate  the  commanded  position  of  the  robot 
while  the  direction  to  the  arrow  head  indicates  the  com¬ 
manded  orientation.  The  differences  between  the  arrow 
tails  and  the  ikon  centers  are  very  large  and  the  precision 
can  be  judged  from  the  fact  that  the  wall  segments  as  seen 
from  various  positions  superimpose. 


413 


6.  The  Great  Excursion 


Now,  let  us  take  a  look  at  what  a  real  mobile  robot 
run  is  like.  Mobi  is  turned  on  in  front  of  the  calibra¬ 
tion  chart  discussed  previously,  and  the  camera  param¬ 
eters  are  determined.  Mobi  is  now  moved  out  into  the 
hallway  (right  now,  we  don’t  have  the  precision  to  send  it 
through  doorways).  To  make  the  start  of  this  run  inter¬ 
esting,  we  pointed  mobi  at  a  blank  wall  and  commanded 
it  to  go;  this  was  our  last  action,  now  mobi  was  on  his 
own. 

Sure  enough,  it  did  not  see  any  edges  while  staring 
at  a  white  wall,  and  so  the  robot  performed  the  only 
safe  move,  rotation.  After  this  move,  mobi  still  hasn’t 
seen  enough  to  move  forward  and  rotates  further.  It  is 
confronted  with  its  first  substantial  view,  the  hallway  as 
shown  in  figure  4. 

From  here  it  sees  enough  correspondence  points  to 
build  a  large  enough  region  of  free  space  to  begin  trans¬ 
lating.  Mobi  moves  forward  a  meter  though  not  yet  to¬ 
wards  the  center  of  the  hall  because  it  needs  to  see  more 
points  on  the  right  wall.  After  a  few  wavering  moves,  mobi 
travels  towards  the  end  of  hallway  without  any  further  in¬ 
cidents. 

The  first  great  challenge  looms  off  in  the  distance,  the 
great  white  pillar  (shown  in  figure  5).  This  poses  a  partic¬ 
ularly  interesting  problem  because  the  epipolar  ordering 
constraint  may  be  violated.  Furthermore,  the  building 
corner  causes  a  major  occlusion  of  the  window.  Note  the 
white  signs  on  the  window  are  visible  in  the  right  image 
and  are  occluded  in  the  left  image.  The  occluding  edge 
changes  from  light  gray  to  black  in  the  left  image  while 
switching  from  gray  to  bright  white  in  the  right  image. 
Also  note  the  black  wire  dangling  close  to  the  corner.  This 
the  real  world  after  all.  The  robot  still  sees  a  large  region 
of  free  space  and  heads  towards  the  water  fountain.  Fi¬ 


nally,  mobi  gets  close  to  the  end  of  the  hallway,  sees  little 


and  decides  to  round  the  corner  as  shown  in  the  motion 


sequence  of  figure  12  on  this  page. 

In  the  first  stereo  pair,  the  pillar  is  seen  occluding  the 
paper  signs  in  the  right  image  but  not  in  the  left,  caus¬ 
ing  edge  inversion  again.  Still  it  is  recognized,  and  Mobi 
passes  the  pillar.  Seeing  the  reflections  of  lobby  which 
we’ll  encounter  in  the  next  motion  sequence,  Mobi  heads 
straight  toward  the  window.  Fortunately  for  us  (windows 
are  expensive),  mobi  sees  the  edges  of  the  aforementioned 
paper  signs  and  the  two  bright  dots  that  are  coincidently 


414 


at  eye  level.  Mobi  rotates  again  to  avoid  bumping  the 
window  and  one  of  the  proud  authors  is  seen  in  the  re¬ 
flection,  smiling  from  ear  to  ear.  Rotations  continue,  and 
we  are  now  facing  the  far  end  of  the  lobby,  featuring  two 
Stanford  Arms  of  old,  barm  and  yarm. 

Mobi’s  excursion  continues  across  the  lobby  heading 
to  window  on  the  other  side  of  lobby.  However,  as  it 
neared  the  window,  mobi  saw  its  own  reflection  in  the 
window  and  turned  away.  Unfortunately  for  the  authors, 
this  and  a  few  preceding  images  have  been  destroyed,  but 
we  can  still  go  to  the  video  tape.  Now  we  enter  the  last 
image  sequence  where  the  window  is  filling  the  right  image 
and  a  bulletin  board  is  dominating  the  left  image.  The 
disparity  is  nearly  half  an  image  because  we  are  so  close. 
Of  course  mobi  rotates  and  faces  a  similar  scene.  This 
scene  is  cluttered  but  fits  our  model  of  a  wall  with  surface 
marks.  Rotations  continue  with  the  far  door  coming  into 
view.  Mobi  is  now  looking  at  a  large  area  of  free  space 
and  moves  forward  towards  that  door. 

Sadly,  the  run  ended  because  of  a  communication  fail¬ 
ure  between  the  lisp  machine  and  the  robot  permitting  the 
researchers  and  onlookers  to  retire  for  the  evening  after 
celebrating  over  champagne. 


7.  Conclusion 

So,  we  have  seen  that  Mobi  can  successfully  navigate 
through  the  inside  of  a  building  under  automatic  visual 
control.  The  algorithm  used  during  the  discussed  excur¬ 
sion  took  between  12  and  18  seconds  per  vision  step  plus 
motion  time;  Though  our  choice  of  algorithm  was  aimed 
at  producing  a  lively  rate,  neither  the  hardware  nor  coding 
w.ts  optimized,  and  too  much  time  was  spent  on  commu¬ 
nication  between  computers. 

By  choosing  a  model  that  can  be  simply  instantiated 
with  the  detection  of  vertical  edges  in  the  world,  process¬ 
ing  time  has  been  greatly  reduced.  Furthermore,  motion 
sequences  have  been  aggregated  to  reduce  the  uncertainty 
in  navigation  and  stereo.  Finally,  a  symbolic  model  of 
building  structure  has  been  automatically  created. 

A  knowledgcment  s 

Our  thanks  go  to  AFOSR,  IU,  ALV,  FJHF  and  in  par¬ 
ticular  to  Tom  Binford,  Soon  Yao  Kong  and  Ron  Fearing 
for  their  help  throughout  this  work. 


8.  References 


2.  Triendl,  Ernst;  Kriegman,  David,  Stereo  Vision  and 
Navigation  within  a  building,  to  be  published  in  IEEE 
conference  on  Robotics  and  Automation,  1987. 

3.  D.J.  Kriegman;  E.  Triendl;  Tom  Binford,  A  Mobile 
Robot:  Sensing,  Planning  and  Locomotion,  To  be 

published  in  IEEE  Int.  Con/.  Robotics  &  Automa¬ 
tion,  1987. 

4.  A.R.  de  Saint  Vincent,  A  JD  Perception  System  for 
the  Mobile  Robot  Hilare,  Proc.  IEEE  Ini.  Conf. 
Robotics  &  Automation,  1986. 

5.  S.  Tsuji  et  al.,  Stereo  Vision  for  a  Mobile  Robot: 
World  Constraints  for  Image  Matching  and  Interpre¬ 
tation,  Proc.  IEEE  Int.  Conf.  Robotics  &  Automa¬ 
tion,  1986. 

6.  A.M.  Waxman  et  al.,  A  Visual  Navigation  System, 

Proc.  IEEE  Int.  Conf.  Robotics  &  Automation, 
1986. 

7.  Smith,  Randall;  Self,  Matthew;  Cheeseman,  Pe¬ 
ter,  Estimating  Uncertain  Spacial  Relationships  in 
Robotics,  Proceeding  AAAI  Workshop  on  Uncer¬ 
tainty  in  Artificial  Intelligence,  August,  1986. 

8.  Brooks,  Rodney  A.,  Visual  Map  Making  for  a  Mobile 
Robot,  Proceedings  IEEE  International  Conference 
on  Robotics  and  Automation,  1985. 

9.  Chatila,  Raja;  Laumond,  Jean-Paul,  Position  Refer¬ 
encing  and  Consistent  World  Modeling  for  Mobile 
Robots  Proc.  IEEE  Int.  Conf.  Robotics  &  Automa¬ 
tion,  1985. 

10.  H  P.  Moravec,  The  Stanford  Cart  and  the  CMU 
Rover,  Proc.  IEEE,  vol.  71,  no.  7,  July  1983. 

11.  H.  Baker,  Depth  from  Edge  and  Intensity  Based 
Stereo  AIM-Sfl,  Stanford  University,  1982. 

12.  R.A.  Brooks,  Symbolic  Reasoning  among  3-D  Models 

and  2-D  Images  Ph.D.  dissertation,  Stanford  Univer¬ 
sity,  1981. 

13.  Junkins,  J.L.,  An  Introduction  to  Optimal  Estima¬ 
tion  of  Dynamical  Systems  Sijthoff  &  Noordhoff  In¬ 
ternational  Publishers  B.B.,  1978. 


14.  E.  Triendl,  ModeUierung  von  Kanten  bei  unregel- 
mifliger  Hastening  in  Bildverarbeiiung  und  Muster- 
erkennung,  E.  Triendl  (ed),  Springer,  Berlin,  1978. 

15.  — ,  How  to  get  the  Edge  into  the  Map  Proc.  fth  In¬ 
ternational  Conference  on  Pattern  Recognition,  Ky¬ 
oto,  1978. 


416 


AuRA:  AN  ARCHITECTURE  FOR  VISION-BASED  ROBOT  NAVIGATION 


Ronald  C.  Arkin 
Edward  M.  Riseman 
Allan  R.  Hanson 

Computer  and  Information  Science  Department,  University  of  Massachusetts, 
Graduate  Research  Center,  Amherst,  Massachusetts,  01003 


Abstract 

The  VISIONS  research  environment  at  the  University  of  Massachu¬ 
setts  provide?  an  integrated  system  for  the  interpretation  of  visual 
data.  A  mobile  robot  ha4  been  acquired  to  provide  a  testbed  for 
segmentation,  static  interpretation,  and  motion  algorithms  developed 
within  this  framework.  The  robot  is  to  be  operated  in  test  domains 
that  encompass  both  interior  scenes  and  outdoor  environments.  The 
test  vehicle  is  equipped  with  a  monocular  vision  system  and  will 
communicate  with  the  lab  via  a  IMF  transmitter-receiver  link. 

Several  visual  strategies  are  being  explored.  These  include  a  fast 
line-finding  algorithm  for  path  following,  a  multiple  frame  depth- 
from-niotion  algorithm  for  obstacle  avoidance,  and  the  relationship 
of  schema-based  scene  interpretation  to  mobile  robotics,  especially 
regarding  vehicle  localization  and  landmark  recognition.  These  al¬ 
gorithms  have  been  tailored  to  meet  the  needs  of  mobile  robotics  by 
utilizing  top-down  knowledge  when  available  to  reduce  computational 
demands. 

To  facilitate  the  implem*  ntation  of  these  algorithms,  the  PM  ASS 
Autonomous  Robot  Aichifectnre  (AuRA)  lias  been  developed.  Em¬ 
ploying  both  high-level  semantic  knowledge  and  control  structures 
consisting  of  low-level  motor  schemas,  action-oriented  perception  and 
schema-based  navigation  are  being  investigated.  Initial  path  plan¬ 
ning  is  conducted  through  a  ground-plane  meadow-map  which  con¬ 
tains  semantic  and  visual  data  relevant  to  the  actual  navigational 
execution  of  the  path,  information  fr«>m  the  meadow- map  along  with 
the  goals  developed  b\  the  mid-level  path  planner  are  then  passed  to 
the  pilot,  which  selects  appropriate  motor  schema?  for  instantiation 
in  the  motor  schema  manager.  Pertinent  vi«jr*n  algorithms  or  per¬ 
ceptual  schemas  ace  associated  with  each  instantiated  motor  schema 
to  guide  the  vehicle  along  its  way 

Results  of  both  actual  robot  navigation  in  an  outdoor  campus 
••nvirouineiit  and  appi <>pi i.i t ••  simulations  are  presented  to  show  the 
viability  of  this  A I  architect  me  Tlw*«r  examples  concentrate  in  the 
three  areas  of  vision  reseanh  mentioned  above. 


1.  Introduction 

The  VISIONS  group  at  the  University  of  Massachusetts  has 
an  extensive  and  ongoing  research  project  in  the  interpretation 
of  real-world  images  [t,2|.  More  recently,  efforts  are  being  made 
to  migrate  many  of  the  concepts  developed  within  the  VISIONS 
system  to  the  application  of  mobile  robot  navigation.  A  mo¬ 
bile  robot  has  been  acquired  to  provide  an  experimental  testbed 
for  this  effort.  This  vehicle  (a  Denning  Mobile  Robot  -  DRV) 
is  equipped  with  a  monocular  video  camera  and  a  ring  of  24 
ultrasonic  sensors. 

This  research  wae  supported  in  part  by  the  General  Dynamics  Cor¬ 
poration  under  grant  DKY-fifllSSC  and  the  PS.  Army  under  ETL  grant 
DACA76-85-C-008. 


AuRA  (Autonomous  Robot  Architecture)  is  a  system  archi¬ 
tecture  that  provides  necessary  extensions  to  the  VISIONS  sys¬ 
tem  that  are  primarily  concerned  with  safe  mobile  robot  navi¬ 
gation.  These  extensions  include  the  addition  of  representations 
specific  to  navigation,  the  incorporation  of  motor  schemas  as  a 
means  of  associating  perceptual  techniques  with  motor  behav¬ 
iors,  and  the  introduction  of  homeostatic  control  utilizing  inter¬ 
nal  sensing  as  a  means  for  dynamically  altering  planning  and 
motor  behaviors. 

The  remainder  of  this  paper  is  divided  into  the  following 
sections.  Section  2  will  present  an  overview  of  the  AuRA  ar¬ 
chitecture.  Section  3  will  describe  how  navigation  is  accom¬ 
plished  within  AuRA,  specifically  the  roles  of  long-term  and 
short-term  memory  and  the  operation  of  the  navigator,  pilot 
and  motor-schema  manager  Section  4  describes  the  relation¬ 
ship  of  VISIONS  to  AuRA,  concentrating  particularly  on  the 
VISIONS  schema  system  Modular  vision  algorithms,  including 
techniques  already  embedded  as  well  as  those  currently  being 
investigated  for  potential  incorporation,  are  described  in  section 
5.  Experimental  results,  including  robot  experiments  as  well  as 
simulations,  are  presented.  A  summary  of  the  paper  and  a  brief 
description  of  future  work  complete  the  report. 

2.  Architect  tire  Overview 

A  block  diagram  of  AuRA  is  presented  in  Figure  I  AuRA 
consists  of  five  major  components:  the  planning,  cartographic, 
perception,  motor  and  homeostatic  subsystems.  The  planner 
consists  of  the  motor  schema  manager,  pilot,  navigator  and  mis¬ 
sion  planner  and  is  described  in  section  3.  A  cartographer,  whose 
task  is  to  maintain  the  information  stored  in  long-  and  short¬ 
term  memory  and  supply  it  on  demand  to  planning  and  sensory 
modules,  provides  the  additional  functionality  needed  for  navi¬ 
gational  purposes.  Long-term  memory  (LTM)  stores  the  a  pri¬ 
ori  knowledge  available  to  the  system,  while  short-term  memory 
(STM)  contains  the  acquired  perceptual  model  of  the  world  over¬ 
laid  on  an  LTM  context.  The  cartographer  is  also  responsible 
for  maintaining  the  uncertainly  in  the  vehicle’s  position. 

A  perception  subsystem,  (ultimately  consisting  of  the  VI¬ 
SIONS  system,  sensor  processing  and  sensors),  is  delegated  the 
task  of  fielding  all  sensory  information  from  the  environment, 
performing  preliminary  filtering  on  that  data  for  noise  removal 
and  feature  enhancement,  then  extracting  perceptual  events  and 
structuring  the  information  in  a  coherent  and  consistent  manner, 
and  finally  delivering  it  to  the  cartographer  and  motor  schema 
manager.  It  is  also  the  subsystem,  in  conjunction  with  the  car¬ 
tographer,  where  expectations  are  maintained  to  guide  sensory 
processing. 

The  motor  subsystem  is  the  means  by  which  the  vehicle  in¬ 
teracts  with  its  environment  in  response  to  sensory  stimuli  and 


417 


Figure  1.  System  Architecture  f«»r  AtiKA 


high-level  plans.  Motors  and  motor  controllers  serve  to  effect  the 
necessary  positional  changes.  A  vehicle  interface  directs  the  mo¬ 
tor  controllers  to  perform  the  requested  motor  response  received 
from  higher  level  processing. 

The  homeostatic  control  subsystem  is  concerned  with  the 
maintenance  of  a  safe  internal  environment  for  the  robot.  Inter¬ 
nal  sensors  provide  information  which  can  dynamically  affect  the 
decision-making  processes  within  the  planner,  as  well  as  modify 
specific  motor  control  parameters. 

The  first  pass  implementation  of  the  perceptual  system  does- 
not  draw  on  the  VISIONS  system  in  its  entirety  The  VISIONS 
system  is  ultimately  expected  to  be  the  location  where  all  sensor 
fusion  occurs,  yielding  ideally  a  rich  3-D  model  of  the  perceived 
world.  At  this  stage  in  AuRA’s  development,  however,  rele¬ 
vant  vision  algorithms  are  extracted  from  the  VISIONS  environ¬ 
ment  and  are  used  outside  of  its  context.  The  real-time  needs 
of  mobile  robotics  can  be  handled  by  this  strategy  as  the  vision 
algorithms  are  not  yet  developed  on  parallel  hardware.  In  so  do¬ 
ing,  the  cartographer  assumes  greater  responsibility  t  han  might 
be  needed  in  future  designs  when  many  of  the  cartographer’s 
responsibilities  are  subsumed  by  the  VISIONS  system.  Figure 
2  shows  AuRA’s  initial  implementation  strategy.  Homeostatic 
control  is  to  be  implemented  only  after  the  motor  schema  man¬ 
ager  is  moved  from  simulation  to  real-time  implementat  ion  and 
the  vehicle  is  equipped  with  the  necessary  internal  sensors.  The 
mission  planner  is  currently  rudimentary  and  has  a  low  priority 
for  development. 

The  subsections  that  follow  describe  briefly  the  ultimate  roles 
of  the  various  AuRA  subsystems  (with  the  exception  of  the  plan¬ 
ning  subsystem  which  is  discussed  in  section  3.) 

2.1  Cartographer 

The  cartographer  is  the  manager  of  the  non- VISIONS  rep¬ 
resentations  and  high-level  controller  of  the  map  maintenance 
processes,  Its  responsibilities  include: 


•  Preservation  uf  the  integrity  of  the  perceived  world  model, 
recone -i ling  temporally  conflicting  sensor  data. 

•  Initiating  and  scheduling  processes  whose  duty  it  is  to: 

incorporate  data  from  t fie  perception  subsystem  into 

short-term  memory  (STM ) 

instantiate  models  from  LTM  into  STM 

provide  «ensor  expectations  and  to  guide  schema  in- 

k!  ;i nt  ini  inn* 

•  Maintenance  nf  uncertainty  at  all  levels  of  representation 

spatial  ermr  map  maintenance  for  robot  localization 
STM  environmental  uncertainty  handling  (object  lo- 
<  at  son  ) 

•  Initial  IT\1  Map  building  (i.e  knowledge  acquisition) 

2.2  Perception  Subsystem 

Fri'  irmimerit  a  I  sensor  processing  occurs  within  the  confines 
of  the  perception  subsystem  This  component  of  AuRA  consists 
of  three  submodule  types:  sensors,  sensor  processors,  and  the 
VISIONS  system  In  the  early  stages  of  the  robot  system  devel¬ 
opment.  we  utilize  a  small  subset  of  available  vision  algorithms, 
including  simplified  versions  of  some  low-level  algorithms  that 
are  tuned  for  real-time  performance,  until  a  full  real-time  scene 
interpretation  VISIONS  environment  becomes  available. 

Sensor  processors  serve  to  preprocess  the  sensor  data  into  a 
form  that  is  acceptable  to  the  receiving  modules.  The  principal 
goal  for  these  sensor-specific  filters  (e  g.  from  vision,  ultrasonic  or 
dead-reckoning  sensors)  is  to  simplify  the  job  facing  the  VISIONS 
system  by  removing  noisy,  extraneous,  or  errorful  data  and  by 
'converting  the  relevant  data  from  diverse  sensors  into  a  form 
that  is  integrable  into  world  representations. 


418 


Figure  2.  First  Pass  Implementation  of  AuRA  Architecture 


The  VISIONS  system  is  the  heart  of  the  perception  sub¬ 
system.  Multiple  levels  of  sensor  data  and  their  associated  in¬ 
terpretations  are  present.  Perceptual  schemas  are  instantiated 
and  maintained  within  this  system.  The  net  result  is  a  collec¬ 
tion  of  plausible  hypotheses  and  interpretations  for  sensor  data 
with  associated  confidence  levels  that  reflect  their  uncertainty. 
Data  can  be  drawn  off  by  the  planner  at  any  representation  level 
within  the  VISIONS  system,  ranging  from  low-level  pixel  data 
and  intermediate-level  lines  and  surfaces,  to  high-level  full  scene 
interpretations. 

Information  foretelling  imminent  danger  will  pass  directly  to 
the  pilot  or  vehicle  interface  from  the  sensor  processors  via  panic 
shunts  without  the  mediation  of  the  cartographer,  VISIONS  sys¬ 
tem  or  motor  schema  manager.  These  panic  shunts  are  intended 
to  emulate  reflex  arc  activity,  bypassing  higher  level  processing. 

2.3  Motor  Subsystem 

The  motor  subsystem  is  delegated  the  responsibility  of  ef¬ 
fecting  the  commands  of  the  motor  schema  manager.  The  motor 
subsystem  consists  of  three  major  components:  motors,  motor 
controllers,  and  vehicle  interface.  The  steering  motors,  drive  mo¬ 
tors  and  motor  controllers,  in  the  case  of  the  UMASS  DRV,  are 
provided  by  the  manufacturer. 

The  vehicle  interface,  in  its  most  general  version,  will  use  the 
output  of  motor  schemas  as  a  means  for  effecting  action  in  the 
environment.  This  module  is  the  one  component  of  the  overall 
architecture  which  is  most  profoundly  influenced  by  the  specific 
robot  vehicle  chosen.  Vehicle  independence  is  a  design  goal  for 
all  other  AuRA  modules.  The  vehicle  interface  translates  the 
commands  from  the  motor  schema  manager  into  the  specific  form 
required  for  the  vehicle. 


2.4  Homeostatic  Control  Subsystem 

Concern  for  behavioral  changes  in  planning  due  to  the  inter¬ 
nal  state  of  the  robot  has  not  been  encountered  elsewhere  in  the 
literature.  Most  Bystems  assume  optimal  conditions  at  all  times, 
others  (e.g.  [3j)  operating  in  hazardous  environments  simply  de¬ 
termine  if  it  is  safe  or  not  to  enter  a  particular  location,  still 
others  (e.g,  [32] )  make  plans  based  on  fuel  reserves  and  other 
factors,  but  the  robot’s  dynamic  behavior  is  not  considered 

In  the  proposed  system,  internal  surveillanre  of  the  robot  is 
constantly  maintained  by  appropriate  sensors  'Life'  -threaten- 
ing  conditions  such  as  excessive  temperatures,  corrosive  atmo¬ 
spheres,  or  low  energy  levels,  can  dynamically  alter  variables  in 
the  motor  subsystem  and  affect  decision-making  within  the  plan¬ 
ning  subsystem.  A  detailed  description  of  the  issues  for  such  a 
homeostatic  control  system  can  be  found  in  |5|.  This  extended 
functionality  will  provide  the  robot  with  enhanced  survivability 
through  a  greater  capability  to  respond  to  a  changing  environ¬ 
ment. 

Although  initial  system  designs  will  assume  optimal  condi¬ 
tions  for  the  homeostatic  control  system,  in  order  to  simplify  the 
integration  of  this  concept  into  later  versions,  its  design  consid¬ 
erations  will  be  dealt  with  from  the  start 

3.  Navigation 

Several  papers  document  the  navigational  and  path  planning 
techniques  used  in  AuRA.  The  roles  of  the  navigator  and  long¬ 
term  memory  are  described  in  [6]  and  the  motor  schema  man¬ 
ager’s  function  is  presented  in  |7[.  A  more  complete  description 
of  the  planning  subsystem  and  navigational  representations  ap¬ 
pears  in  [8).  The  intent  of  this  section  is  to  provide  an  overview 


419 


of  the  process  of  navigation  for  our  system,  concentrating  par¬ 
ticularly  on  the  relationship  of  visual  perception  to  the  robot’s 
path  choice  and  successful  path  completion. 

There  are  two  distinct  levels  of  path  planning  available:  map- 
navigation,  based  on  a  priori  knowledge  available  from  the  car¬ 
tographer  and  embedded  in  long-term  memory;  and  sensor-data- 
driven  piloting  conducted  bv  the  motor-schema  manager  upon 
the  receipt,  of  instructions  from  the  pilot.  The  motor  schema 
manager  is  perhaps  best,  viewed  as  the  execution  arm  of  the  pilot, 
responding  to  the  perceived  world  in  an  intelligent  manner  while 
striving  to  satisfy  the  navigator’s  goals.  First,  let’s  examine  the 
hierarchical  planning  component  of  the  planning  subsystem. 

A  hierarchical  planner,  consisting  of  a  mission  planner,  nav¬ 
igator  and  pilot  (fig.  3),  implement  the  requested  mission  from 
the  human  commander  The  functions  of  the  three  hierarchical 
submodules  arc  described  below.  It  should  be  remembered  that 
communication  is  two  way  across  the  submodule  interfaces,  but. 
is  predictable  and  predetermined,  the  central  characteristic  of 
hierarchical  emit ml 


3.1  Mission  Planner 


The  mission  planner  is  given  the  responsibility  for  high-level 
planning.  This  includes  spatial  reasoning  capabilities,  determi¬ 
nation  of  navigation  and  pilot  parameters  and  modes  of  opera¬ 
tion.  and  selection  of  optimality  criteria.  Input,  to  this  module 
is  from  three  -  ourn  -  the  cartographer,  the  homeostatic  control 
subsystem  and  i  he  human  commander.  The  cartographer  pro¬ 
vides  current  world  status,  including  both  short-term  and  long¬ 
term  memory  structures  The  homeostatic  control  system  pro¬ 
vides  flat  a  regarding  the  robot’s  current  internal  status:  energy 
and  temperature  levels  and  other  relevant  safety  considerations 
that  have  a  hearing  on  the  robot’s  ability  to  successfully  com¬ 
plete  any  plan  \'<*  assumptions  should  be  made  by  the  planner 
that  the  robot  h.i<  the  necessary  resources  available  to  complete 
any  plan  that  d^vdoped  This  is  crucial  for  reliable  long-range 
planning  ce-  bdittes 

Mission  commands  are  entered  by  the  human  commander 
through  a  user  interface  The  exact  structure  of  these  commands 
will  he  dictated  by  tin*  task  domain  (domestic,  military,  indus- 


SUOllMl 

Resolution 

OJ  . 


5  I 


r«un>8n  i 
f.  mnitfrr  ' 


MISSION 

PlHNNlfl 


k — p«< 

|  f  wfOt*  .  9v.#tc  ) 


MomroOOK 
Control  I 


la  Com* 


(5«/CcM»*v1  corral* I «on  or 
mUlhqont  4i*o6*4i*ac») 


QqIqt  Schmu  frO 


Motor 

Schema 

Monoper 


i Are*  i 


Figure  S.  Hierarchical  Planner  for  AtiRA 


trial,  etc.) 

Real-time  operat  ion  is  riot  as  crucial  for  mission  planning  as  it 
is  for  lower  levels  in  t  lie  planning  hierarchy.  Nonetheless,  efficient 
replanning  may  be  necessary  at  this  level  upon  rc'-eipt  of  si  at  us 
reports  from  the  navigator  indicating  failure  of  ihe  attainment 
of  any  subgoal. 

The  output  of  the  mission  planner  is  directed  to  the  naviga¬ 
tor.  It  will  consist  of  a  list  of  parameters  and  modes  < , f  operation 
that  determine  the  overall  behavior  of  the  robot.  Additionally, 
a  list  of  mission  specifications  and  commands  (.subgoals)  for  the 
current  task  will  be  provided 

The  mission  planner,  although  a  significant  •  omputuuii  of  the 
overall  architect  ure.  has  a  relatively  Imv  priority  for  imp!'- men¬ 
tation  at,  this  time. 

3.2  Navigator 

The  navigator  accepts  the  spn  ificai  ion^  and  1  -chav  h-rai  pa¬ 
rameter  lists  from  the  mission  planner  and  designs  a  point  to 
point  patli  from  st  art  to  goal  based  on  the  current  <?  pn-r  i 'world 
model  stored  in  l,TM.  The  representation  find  um  d  by  the  nav¬ 
igator  is  the  “meadow  map"  a  hybrid  vertex  ';r.iph  fre«*-spare 
world  model.  Status  reports  are  issued  back  to  !  h<*  um-mou  plan¬ 
ner  either  upon  successful  completion  of  ih«*  ihi^im::  p.  -  ifb  a- 
tions  (subject  to  the  behavioral  const  rami  >.j  ,r  upoi  to 

meet  the  requisite  goals.  If  failure  results.  ih<-  reason  |-«r  failure 
is  reported  as  well. 

The  meadow  map’s  basic  structure  m  *  .m  s 1  of  work 
by  Crowley  jOi  and  (inalt  and  Chalil.t  |f.l  U:  <  »-u  work 
is  distinguished  from  that  which  preceded  u  by  i  he  i  .-  orpora- 
tion  of  extensions  to  include  multiple  terrain  typ  -  'tie  me  of 
specialized  map  production  algorithms,  the  avail  <i  ;m\  of  sev¬ 
eral  search  strategies  and  the  ability  toe.i  .iiy  ••mb'-  !  ]  •  .  ♦•pnnl 
knowledge.  Data  stored  .ii  this  level  reflect.  .  .rije* :  ai’y  and 
topologically  the  robot’s  modeled  world  \  pniyv  oil  p  p-.-vi 
mat  ion  of  all  obstacles  is  used  to  <implil\  i><  l  n  map  bm  .ding  mi 
navigational  computation  The  necessary  visual  representations 
(feature  map)  for  path  execution  -and  uncertainty  management 
are  tied  to  these  polygonal  ground  plane  projection  models.  This 
map  serves  as  the  basis  for  the  robot's  short -t etui  menu  r\  con 
text.  Specific  components  are  instantiated  m  “  I'M  based  upon 
the  robot's  current  position  and  the  cuireut  navigational  sub- 
goal. 

The  2-1)  feature  map  <  an  be  vie\ve<l  as  a  facet  <>r  ippendage  of 
the  associated  meadow  map  Data  pertaining  tr  tie*  distinctive 
features  of  terrain,  obstacles,  landmarks  and  other  ••ugintea  that 
are  embedded  wit  bin  t  hr  meadow  map  const  itutc  i  hr  J-1  >  fra’  ure 
map.  The  informal  ion  si  orrd  hrre  < out  ains  1  he  at  t  ■  1  uiu-s  nf  t  be 
meadow  map  s  vertices  ll-D  represent  an  -ns),  lines  id  D  linage 
representations),  and  polygons  and  their  assncnited  obstacles  or 
free  space  (3-D  models) 

Depending  upon  t  he  robot  s  <  urrent  position,  meadow  from 
long-term  memory  are  moved  into  short-term  memory  The«e 
contain  information  on  landmarks  currently  visible,  feaiures  of 
known  obstacles,  terrain  characierist ics,  and  the  like  Tins  data 
is  available  for  prediction  by  the  perception  subsystem  or  for 
use  by  the  pilot  for  schema  instantiation  All  meadows  visible 
from  the  robot’s  current  position  and  orientation  as  well  as  those 
expected  to  be  visible  during  the  path  traversal  will  be  made 
current  in  STM. 

Output  of  the  navigator  is  directed  to  the  pilot  This  output 
consists  of  a  point-to-point  pat  h  and  necessary  parameters  and 
modes  that  will  affect  t  he  pilot's  overall  behavior  Kssent ially. 
the  navigator  is  model-driven,  (the  model  being  t he  meadow 
map),  passing  ofr  its  goals  to  the  data  (sensor)-dri  vm  pilot  Sta¬ 
tus  information  is  received  by  the  navigator  from  the  pilot  in¬ 
dicating  either  the  successful  completion  or  failure  of  the  estab- 


420 


lished  goals  of  the  pilot.  Upon  pilot  failure,  the  navigator  may 
initiate  replanning  without  reinvoking  the  mission  planner.  Time 
constraints  are  more  critical  for  the  navigator  than  the  mission 
planner  but  are  not  as  stringent  as  those  needed  for  the  real¬ 
time  requirements  of  the  pilot  and  motor  schema  manager.  See 
[6)  for  a  complete  description  of  the  navigator  and  meadow-map 
representation. 

3.3  Pilot 

The  pilot's  function  is  to  accept  a  point-to-point  path  speci¬ 
fied  by  the  navigator  and  provide  the  robot  with  suitable  motor 
behaviors  that  will  lead  to  its  successful  traversal.  The  pilot 
accomplishes  this  by  selecting  appropriate  motor  schemas  from 
a  repertoire  of  available  behaviors  (based  on  the  current  long¬ 
term  memory  context)  and  passing  them  (properly  parameter¬ 
ized)  to  the  motor  schema  manager  for  instantiation.  From  that 
point  on,  path  execution  is  turned  over  to  the  motor  schema 
manager.  During  actual  path  traversal,  the  cartographer  con¬ 
currently  builds  up  a  short-term  memory  representation  of  the 
world  based  on  available  sensor  data.  If,  for  some  reason,  the 
motor  schema  manager  fails  to  meet  its  goal  within  a  prescribed 
amount  of  time,  the  pilot  is  reinvoked  and  an  alternate  path 
is  computed  by  the  pilot,  based  on  both  the  LTM  context  and 
STM.  Approximating  polygons  representing  sensed  but  unmod¬ 
eled  (i.e.  unexpected)  objects  are  inserted  into  the  local  ground 
plane  instantiated  meadows  and  the  convex-decomposition  algo¬ 
rithms  (used  by  the  cartographer  to  build  LTM)  are  run  upon 
them.  These  “fractured”  meadows  serve  for  short-term  path  re¬ 
orientation  by  the  pilot  and  the  basis  for  the  instantiation  of  new 
motor  schemas. 

Associated  parameters  for  the  slot-filling  of  motor  schemas 
will  be  provided  by  the  mission  planner,  navigator  and  LTM.  The 
new  commands  issued  by  the  pilot  will  result  in  motor  schema 
instantiation  within  the  motor  schema  manager. 

Typical  motor  schemas  include: 

•  Move-ahead:  Move  in  a  specified  direction 

•  Move-to-goal:  Move  to  an  identifiable  world  feature. 

•  Avoid-static-obstacle:  Avoid  collision  with  unmodeled 
stationary  obstacles. 

•  Stop-when:  Stop  when  a  specified  sensory  event  occurs. 

•  Stay-on-path:  Remain  on  an  identifiable  path  (roadside- 
walk,  etc  ) 

Associated  perceptual  schemas  (run  in  the  context  of  the 
motor  schema  manager)  include: 

•  Find-obstacle:  identify  potential  obstacles  using  a  par¬ 
ticular  sensor  strategy. 

•  Find-landmark:  Detect  a  specified  landmark  using  sen¬ 
sory  data  (for  managing  the  robot’s  positional  uncertainty). 

•  Find-path:  fxrcate  the  position  of  a  path  on  which  the 
robot  is  currently  situated  using  a  specified  sensor  strategy. 

3.4  Motor  Schema  Manager 

Distributed  control  for  the  actual  execution  of  path  travel  oc¬ 
curs  within  the  confines  of  the  motor  schema  manager.  Multiple 
concurrent  schemas  are  active  during  the  robot’s  path  traversal 
in  a  coordinated  effort  to  achieve  successful  path  transition.  A 
potential  field  methodology  (13, M|  is  used  to  provide  the  steer¬ 
ing  and  velocity  commands  to  the  robot.  An  overall  velocity 
vector  is  produced  from  the  individual  vector  contributions  of 
each  active  motor  schema.  This  vector  determines  the  desired 
velocity  of  the  robot  relative  to  its  environment.  When  each 
motor  schema  is  instantiated,  at  least  one  relevant  visual  algo¬ 


rithm  or  perceptual  schema  is  associated  with  it.  Figure  5  shows 
a  simulation  of  a  field  produced  by  the  instantiation  of  several 
independent  motor  schemas.  Additionally,  various  perceptual 
schemas  are  instantiated  to  identify  available  landmarks  (as  pre¬ 
dicted  by  long-term  memory  and  the  current  uncertainty  in  the 
robot’s  position).  These  are  used  to  localize  the  vehicle  without 
necessarily  evoking  motor  action.  The  role  of  the  motor  schema 
manager,  the  potential  fields  representations  it  uses,  and  the  un¬ 
derlying  motivation  for  its  use  are  presented  in  |7). 

3.5  Navigation  Scenario 

Perhaps  the  best  way  to  convey  the  navigational  process 
within  AuRA  is  by  example.  Fig.  4a  represents  an  LTM  meadow- 
map  model  of  the  area  outside  the  Graduate  Research  Center 
at  UMASS.  Embedded  within  this  map,  (although  not  visible 
in  the  figure),  is  additional  data  regarding  landmarks,  building 


Figure  4a.  Outdoor  Meadow  Map 

This  map  represents  the  area  outside  the  Graduate  Research  Center 
when  viewed  from  above.  The  detail  level  of  this  particular  map  stored 
in  LTM  is  low  so  that  small  objects  are  treated  as  unmodeled  obstacles 
for  global  path  planning  purposes. 


Figure  4b.  Global  path  constructed  by  navigator 

An  A*  search  algorithm  is  used  to  search  the  midpoints  and  edges  of 
the  bordering  passable  meadows  to  arrive  at  the  global  path. 


421 


surfaces,  terrain  characteristics,  etc.  This  includes  specific  visual 
cues  to  assist  the  robot  during  its  path  traversal. 

Suppose  the  robot  is  given  the  command  to  go  from  its  cur¬ 
rent  position  (outside  the  GRC  low-rise)  to  meet  Professor  X. 
Available  weather  data  indicates  that  the  grassy  regions  are  cur¬ 
rently  impassable  (the  ground  is  muddy  due  to  rain),  and  the 
robot  must  restrict  its  travel  to  the  concrete  sidewalks  or  the 
gravel  path.  The  mission  planner,  recognizing  this,  sets  the 
traversability  factors  for  the  grassy  regions  to  IMPASSABLE, 
locates  the  fact  that  Prof.  X’s  office  is  in  the  East  Engineering 
(EE)  building,  determines  that  he  is  likely  to  be  in  his  office  at 
this  time  (by  referring  to  the  current  time  of  day  and  the  day 
of  week)  and  then  invokes  the  navigator  to  determine  a  path 
from  the  robot’s  current  position  to  the  door  of  the  EE  building. 
We’ll  ignore  the  indoor  navigation  issues  for  the  purposes  of  this 
paper. 

The  navigator,  based  on  the  instructions  from  the  mission 
planner,  determines  a  global  path  that  satisfies  these  goals  using 
an  A'  search  algorithm  through  the  meadow  boundaries  (fig.  4b 

see  [6|  for  the  details  of  how  this  is  accomplished).  This  path 
consists  of  5  legs,  the  individual  piecewise  linear  components  of 
the  path.  Let’s  look  particularly  at  leg  3,  where  the  robot  is 
to  follow  the  gravel  path  (i.e.  assume  the  robot  has  successfully 
traversed  the  first  2  legs  of  this  path).  The  pilot  receives  the 
message  to  travel  from  point  M,  representing  the  center  of  prob¬ 
ability  of  the  robot's  current  position,  to  N,  the  end  of  the  gravel 
path 

The  pilot  now  has  available  in  short-term  memory  “instanti¬ 
ated  meadows"  (i.e  those  LTM  meadows  over  which  the  robot 
is  expected  to  pass  during  this  particular  leg  of  the  journey,  and 
several  additional  visible  adjacent  meadows,  all  pro’  I  Jed  by  the 
cartographer)  From  this  LTM  data,  the  pilot  tracts  the  fol¬ 
lowing  relevant  facts: 

1.  Path  -  The  robot  is  to  travel  c  r  „ravel  path  bordered 
on  either  side  by  grass. 

2.  Landmark  -  At  the  end  of  the  path,  near  where  the  robot 
is  to  turn,  is  a  lamppc  i. 

3.  Landmark  -  Off  to  the  right  of  the  path  appears  a  bright 
red  fire  hydrant  (a  readily  discernible  landmark). 

4.  Landmark  -  To  the  left  of  the  path,  the  robot  will  pass  the 
GRC  tower,  a  16  story  building  (another  good  landmark), 

5.  Obstacles  -  It  is  possible  (as  always)  that  people,  cars  or 
unmodeled  obstacles  may  be  present  on  the  path  (either 
stationary  or  moving) 

6.  Goal  -  At  the  end  of  this  path  there  is  a  change  in  terrain 
type,  from  gravel  to  concrete. 

1  :s  useful  for  a  path  following  strategy,  2  and  6  are  useful  for 
goal  recognition,  1,2,3, 4, 6  are  useful  for  localization  purposes, 
and  5  is  necessary  for  obstacle  avoidance 

From  this  information,  the  pilot  determines  that  appropriate 
behaviors  for  this  particular  leg  (travel  across  the  gravel  path) 
include: 

A  Stay-on-path(flnd-path(gravel)) 

B  Move-ahead  (NNE  30  degrees) 

C\  Move-to-goal(right(find  landmark(LAMPPOST-107),3)) 

D.  Move-to-goal(flnd-transltion-*one(gravel, concrete)) 

E.  Flnd-landmark(HYDRANT  2) 

F  F!nd-landmark(CRC  TOWERfface .3)) 

G.  A  void-obstacles 


Motion  is  first  initiated  by  the  move-ahead  schema,  direct¬ 
ing  the  robot  to  move  in  a  particular  direction  in  global  coordi¬ 
nates,  in  response  to  the  pilot’s  need  to  satisfy  the  navigator’s 
subgoal  to  move  to  point  N.  This  heading  is  based  on  informa¬ 
tion  contained  within  the  spatial  uncertainty  map  that  reflects 
the  uncertainty  in  the  vehicle’s  position  and  orientation  relative 
to  the  world  map  as  well  as  the  specific  direction  of  this  particu¬ 
lar  path  leg.  It  is  not  critical  that  the  heading  be  exactly  correct; 
indeed  significant  error  can  be  tolerated  due  to  the  presence  of 
the  stay-on-path  motor  schema.  As  soon  as  a  move-to-goal 
schema  becomes  active  (due  to  the  recognition  of  the  goal  -  the 
lamppost  and/or  terrain  type  transition  zone),  the  move-ahead 
schema  is  deinstantiated  in  favor  of  it.  Motor  actions  produced 
by  the  move-ahead  schema  and  move-to-goal  schema  are  mu¬ 
tually  exclusive. 

Stay-on-path(find-path(gravel))  yields  2  perceptual  sub¬ 
schemas  for  one  motor  schema:  find-path-border  -  using  a 
line-finding  algorithm  to  detect  the  position  of  the  path's  edges, 
and  segment-path,  a  perceptual  schema  that  uses  region-based 
segmentation  to  locate  the  spatial  extent  of  the  path  Through 


Figure  5.  Potential  fields  produced  during  leg  traversal 
The  arrows  represent  the  desired  velocity  vectors  that  constrain  the 
robot’s  motion. 

a)  Before  the  goal  is  identified,  the  move-ahead  and  stay-on-path 
Sis  conduct  the  robot  on  its  way.  A  single  obstacle  SI  is  present. 

b)  After  the  goal  is  identified,  the  move-to-goal  SI  replaces  the  move- 
ahead  SI.  Two  obstacle  Sis  are  shown  as  the  goal  is  approached. 


422 


the  combined  efforts  of  these  cooperating  schemas  the  position 
of  the  path  relative  to  the  robot  is  ascertained.  As  a  result  of 
the  posted  path  position,  the  stay-on-path  motor  schema  pro¬ 
duces  an  appropriate  velocity  vector  moving  the  vehicle  towards 
the  center  of  the  path  (fig.  5a). 

We  will  define  a  schema  instantiation  (SI)  to  be  the  activity 
of  applying  a  general  class  of  schemas  to  a  specific  case  [7,33j. 
The  lamppost  at  the  end  of  the  path  results  in  the  creation 
of  a  find-landmark  schema(s)  dedicated  to  finding  LAMP¬ 
POST-107  whose  model  is  extracted  from  LTM  via  the  instan¬ 
tiated  meadows  in  STM.  This  find-landmark  schema  directs 
the  sensor  processing  by  instantiating  a  VISIONS  perceptual 
schema  and/or  looking  for  particular  strong  vertical  lines  in  a 
given  portion  of  the  image  and/or  utilizing  any  other  relevant 
sensor  algorithm.  Every  time  a  potential  LAMPPOST-107  is 
found  in  the  image,  (perhaps  evidenced  by  a  pair  of  strong  par¬ 
allel  long  vertical  lines  in  an  appropriate  window  of  the  image),  a 
new  LAMPPOST-107  SI  is  created  and  monitored  independently 
of  all  other  similarly  created  LAMPPOST. 107  schema  instanti¬ 
ations.  When  sufficient  supportive  data  is  available  confirming 
that  one  of  the  Sis  is  highly  probable  to  be  the  landmark  desired, 
all  other  LAMPPOST-107  Sis  are  deinstantiated  (or  placed  into 
hibernation)  and  the  appropriate  motor  schema  (move-to-goal) 
starts  producing  a  velocity  vector  directing  the  robot  to  a  point 
3  feet  to  the  right  of  the  identified  lamppost.  If  the  certainty  in 
the  current  LAMPPOST-107  drops  below  a  certain  threshold, 
other  Sis  may  be  activated  or  created  in  response  to  particular 
visual  events  that  correlate  to  the  lamppost’s  model.  Addition¬ 
ally,  output  from  the  find-landmark  schema  is  used  to  update 
the  robot’s  spatial  error  map,  independent  of  any  motor  action 
that  ma  result  from  the  move-to-goal  SI. 

The  :..ove-to-goal(find-traiisition-zone(gravel, concrete)) 
SI  is  handled  in  a  similar  manner,  but  different  perceptual  sch¬ 
emas  are  instantiated  and  the  image  is  searched  in  different  re¬ 
gions.  Texture  measures  for  gravel  are  of  value  as  well  as  the 
presence  of  a  strong  horizontal  line  within  the  boundaries  of  the 
path.  The  move-to-goal  schema  contains  an  implicit  stop- 
when  schema,  so  when  the  target  is  reached  the  pilot  is  notified 
that  the  goal  has  been  achieved  and  the  next  leg  can  be  under¬ 
taken. 

The  find-landmark(H  YDRANT  2)  schema  might  involve  a 
color-based  segmentation,  tagging  all  bright  red  blobs  in  a  partic¬ 
ular  portion  of  the  image  as  a  potential  fire-hydrant.  Ultimately 
size  and  shape  from  a  model  of  the  hydrant  would  be  brought  into 
focus  to  confirm  the  hypothesis  to  prevent  incorrect  identifica¬ 
tions  (e  g.  a  red  car,  or  a  person  with  a  red  coat).  Once  identified, 
this  hydrant  is  then  used  to  reduce  the  uncertainty  in  the  robot’s 
position  (i.e.  localization).  The  same  kind  of  operation  would  be 
involved  in  find-landmark(GRC-TOWER(face.3))  but  instead 
of  using  color  as  the  primary  agent  for  hypothesis  formation,  a 
strong  vertical  line  (the  building  is  16  stories  high!)  or  a  corner 
silhouetted  against  the  sky  would  be  more  suitable  as  the  main 
strategy. 

The  avold-obstacles  schema  is  actually  active  most  of  the 
time.  Simply  put,  the  image  is  windowed  in  the  direction  of 
the  robot’s  motion  and  if  any  unusual  events  occur  in  that  area 
(e.g.  change  in  texture,  color,  strong  line,  etc.)  an  obstacle  SI 
is  associated  with  that  particular  event.  That  portion  of  the 
image  is  monitored  over  time  by  the  obstacle  perceptual  SI  to 
try  to  confirm  or  disprove  the  hypothesis  that  the  visual  event 
is  truly  an  obstacle.  Concurrent  with  the  instantiation  of  the 
obstacle  perceptual  schema  is  the  instantiation  of  an  avold- 
obstacle  motor  schema.  If  the  monitored  obstacle's  certainty 
becomes  sufficiently  high  and  the  robot  enters  within  the  sphere 
of  influence  of  the  obstacle,  then  a  repulsive  velocity  field  is  pro¬ 
duced  by  the  avold-obstacle  SI,  altering  the  robot’s  course.  If, 
on  the  other  hand,  the  hypothesised  obstacle  eventually  is  deter¬ 
mined  to  be  a  phantom  and  not  a  real  obstacle  at  all,  both  the 


perceptual  and  motor  schemas  are  deinstantiated.  When  an  ac¬ 
tive  obstacle  passes  outside  of  the  influence  of  the  vehicle,  its  Sis 
are  deinstantiated  as  well.  Nonetheless,  information  about  the 
obstacle’s  position  is  maintained  in  STM  by  the  cartographer  at 
least  for  the  duration  of  the  leg  traversal. 

Figure  5  shows  a  potential  field  simulation  representative  of 
the  robot  traversing  an  obstacle  studded  path  as  above.  More 
details  regarding  the  interaction  and  operation  of  the  motor 
schemas  in  AuRA  can  be  found  in  [7], 

4.  VISIONS  and  AuRA 

Scene  interpretation  has  long  been  a  primary  research  effort 
within  the  VISIONS  group  at  the  University  of  Massachusetts. 
A  considerable  literature  exists  describing  the  progress  to  date 
[1,2,23,24,29,31).  The  remainder  of  this  section  will  first,  describe 
briefly  the  operation  of  the  schema  system,  followed  by  the  role 
that  the  schema  system  can  play  in  mobile  robot  navigation.  It 
should  be  understood  from  the  onset  that  schema-based  scene 
interpretation  is  currently  a  very  time-consuming  and  compu¬ 
tationally  expensive  process.  Work  is  underway,  however,  to 
provide  parallel  hardware  (the  UMASS  Image  Understanding 
Architecture  [29,30])  to  speed  up  this  process  by  several  orders 
of  magnitude.  Additionally,  available  a  priori  knowledge  present 
in  LTM  can  be  used  to  guide  schema  instantiation  and  reduce 
the  processing  requirements  dramatically. 

4.1  The  Schema  System 

The  VISIONS  schema  system  accepts  an  image  as  input  and 
produces  a  labeled  interpretation  of  the  observed  environmental 
objects  (fig.  6)  and,  to  the  degree  possible,  a  3-D  representation 
of  the  environment.  There  are  3  levels  of  processing  available  ut  i- 
lizing  both  bottom-up  and  top-down  processing  (fig.  7).  Taking 
a  bottom-up  view  first,  the  low-level  processes  operate  on  pixel 
level  data  producing  an  intermediate  symbolic  representation  of 
lines,  surface  and  volume  tokens.  At  the  highest  level,  schema 
processes  exist  which  interpret  and  collect  the  intermediate  rep¬ 
resentations  into  labeled  objects. 

If  no  top-down  guidance  was  available,  it  would  be  virtually 
impossible  for  the  system  to  converge  on  an  acceptable  interpre¬ 
tation.  Perceptual  schemas  (in  the  context  of  VISIONS)  post 
hypotheses  about  what  specific  image  events  mean.  Each  highly 
rated  hypothesis  guides  intermediate  and  low-level  processes  in 
an  effort  to  find  self-supporting  evidence.  This  top-down  guid¬ 
ance  brings  the  intermediate  and  low-level  processing  require- 
ments  down  to  tolerable  levels.  If  the  hypothesis  cannot  find 
sufficient  support  or  is  contradicted  by  other  data,  it  is  deinstan¬ 
tiated.  On  the  other  hand,  if  sufficient  support  for  a  hypothesis 
is  available,  that  particular  portion  of  the  image  will  be  labeled 
as  being  associated  with  a  particular  environmental  object  and 
inference  mechanisms  can  direct  further  semantic  processing 

It  is  quite  difficult  to  describe  the  operation  of  the  schema 
system  in  a  few  paragraphs.  It  is  hoped  that  the  interested 
reader  will  refer  to  the  more  comprehensive  descriptions  eped 
above  [esp.  29,31]  for  a  better  understanding  of  its  operation. 

4.2  Utilization  of  VISIONS  Schemas  in  Mobile 
Robotics 

The  principal  test  domains  to  date  for  VISIONS  schema- 
based  scene  interpretation  have  been  house  scenes  and  road  scen¬ 
es  (fig.  6).  These  efforts  have  been  predominantly  concerned  with 
full  scene  labelings  with  few  specific  expectations  established  for 
the  particular  image  or  environment  in  question  other  than  it 
being  a  house  or  road  scene  (i.e.  there  is  no  world  map  of  the 
domain). 


423 


If  a  priori  knowledge  of  a  .specific  environment  is  available, 
it  can  guide  the  posting  of  schema  hypotheses,  reducing  the  6  Modular  Vision  Algorithms 

amount  of  computing  time  required  to  achieve  a  satisfactory  la¬ 
beling  If  the  robot  s  position  is  approximately  known  within  Although  nothing  in  AuRA  restricts  sensor  processing  to  be 

a  global  map,  this  information  regarding  the  potential  position  predominantly  visual,  much  of  the  architecture  is  constructed 

of  environmental  objects  (e  g  landmarks  or  roads)  can  be  used  to  utilize  this  form  of  sensing.  Action-oriented  perception  is  the 

to  rest  rirt  the  format  mn  of  object  hypotheses  to  particular  por-  fundamental  premise  on  which  motor  schema  sensing  and  naviga¬ 
tions  of  the  image.  Die  occurrence  of  two  known  objects  in  pre-  tjon  based.  It  is  not  necessary  for  the  robot  to  fully  understand 

dieted  positions  relative  to  each  other  can  significantly  increase  the  entire  scene  before  navigation  can  be  initiated  (although  this 

the  plausibility  of  a  proposed  interpretation.  would  certainly  make  things  easier).  Instead,  by  directing  spe- 


VVhcrc  can  schema-based  scene  interpretation  be  used  in  mo-  cific  sensing  strategies  and  the  available  computational  resources 


bile  robotics7  In  the  most,  grandiose  sense,  one  can  say  for  every- 


to  the  motor  needs  of  a  particular  task,  only  those  portions  of 


thing  If  a  completely  and  correctly  labeled  image  is  available,  it 


the  scene  which  can  contr  bute  to  the  attainment  of  the  pilot’s 


can  be  used  for  navigation,  obstacle  avoidance,  localization,  goal  goals  are  analyzed.  Particular  sensor  algorithms  are  rhosen  to 

recognition,  etc  Indeed  the  other  algorithms  described  in  this  fit  the  demands  of  the  specific  path  leg  at  hand 


paper  (line  finding,  region  extraction,  etc.)  actually  constitute 
some  of  the  lower  level  processes  used  within  the  VISIONS  sys¬ 
tem  Being  realistic  however,  one  must  recognize  that  real-time 
responses  are  necessary  for  mobile  robot  navigation,  indicating 
that  schema-based  scene  interpretation  is  presently  too  slow  to 
be  effective  A  more  appropriate  current  use  of  the  VISIONS 
schema  system  would  be  to  provide  for  the  top-down  extraction 
of  semantic  objects  of  interest,  required  for  several  of  the  other 
visual  processes  If  the  initial  image  is  analyzed  by  the  scene 
interpretation  mechanisms,  it,  could  yield  the  road  edges  that 
can  be  used  to  bootstrap  the  stay-on-path  motor  schema  and 
find-path  perceptual  schema  Additionally  it  could  provide  the 
initial  region  statistics  to  seed  the  region  extraction  algorithm 
for  path-following  and  landmark  or  goal  recognition.  Start-up 
information  for  the  depth  from  motion  algorithm  could  be  pro¬ 
vided  as  well,  in  addition  to  potential  corners  that  are  of  use 
for  localization  purposes  by  the  interest  operator.  Finally,  if  the 
robot  becomes  sufficiently  disoriented  relative  to  its  global  map, 
the  schema  interpretation  system  could  be  reinvoked  to  enable 
the  robot  to  regain  its  bearings  relative  to  the  modeled  world. 


No  single  perception  algorithm  is  a  panacea  for  navigation. 
The  designer’s  goal  instead  is  the  development  of  a  wealth  of  vi¬ 
sual  and  other  sensing  algorithms  which  can  provide  the  breadth 
that  multi-domain  navigation  requires.  AuRA  should  be  able  to 
provide  navigational  capabilities  in  both  indoor  and  outdoor  en¬ 
vironments,  allowing  for  considerable  environmental  diversity  in 
each  of  these  cases. 

Computationally  efficient  vision  algorithms  are  used  to  pro¬ 
vide  navigational  information  for  the  robot  and  are  initially  im¬ 
plemented  on  a  single  processor.  The  next  stage  will  be  to  ded¬ 
icate  specific  processors  for  each  algorithm  to  improve  perfor¬ 
mance  and  then  to  eventually  distribute  the  load  over  parallel 
hardware. 

From  an  experimental  point  of  view,  this  architecture  affords 
the  flexibility  to  try  new  perceptual  strategies  without,  forcing 
significant  changes  in  the  supporting  system  components  By 
embedding  motor  actions  as  behaviors  and  perceptual  strategies 
as  focus  of  attention  mechanisms,  both  represented  in  a  schema 
form,  the  addition,  modification  and  deletion  of  these  program 


424 


INFERENCE  «.  PROPAGATION 
Of  BELIEF 


PERCEPTUAl 
ORGANIZATION 
grouping,  splitting,  and 
deleting  ISR  tokens 


STEREO  AND  MOTION  TO 
PRODUCE  DEPTH  ARRAYS 


Figure  T.  Multiple  Levels  of  Representation  and  Processing  in  VISIONS 


units  is  manageable.  The  emphasis  is  on  modularity.  New  world 
representations  can  be  embedded  in  LTM  through  the  use  of  the 
feature  editor,  providing  for  extensions  that  may  be  needed  by 
new  algorithms. 

A  common  thread  running  through  many  of  the  algorithms  is 
their  ability  to  be  decomposed  into  two  phases:  start-up  (boot¬ 
strap)  and  update  (feedforward).  The  start-up  phase  performs 
more  slowly  and  has  less,  if  any,  a  priori  knowledge  to  work 
from  The  start-up  process  produces  initial  region  seed  statis¬ 
tics,  depth  information,  line  orientation,  etc.,  which  can  be  used 
to  advantage  in  subsequent  frame  analysis.  The  update  stage 
uses  the  information  provided  from  the  start-up  phase  to  re¬ 
strict  the  possible  interpretation  of  image  events  and  limit  the 
search  area  for  those  events,  thus  reducing  processing  time  sig¬ 
nificantly.  The  initial  output  of  the  start-up  phase  is  updated 
after  each  processing  run  and  is  fed  forward  to  provide  a  basis 
to  guide  analysis  of  the  next  image. 

The  remainder  of  this  section  will  discuss  some  of  the  sen¬ 
sor  algorithms  that  exist  or  are  being  developed  for  use  within 
AuRA  The  emphasis  will  be  on  visual  processing,  but  some 
work  already  has  been  accomplished  using  ultrasonic  data  as 
well.  The  imaginative  reader  will  undoubtedly  think  of  other 
approaches  and  other  sensors  that  can  be  used  within  a  system 
such  as  AuRA.  The  strategies  described  below  are  not  exhaus¬ 
tive,  but  rather  they  represent  the  current  initial  elements  being 
introduced  by  VISIONS  researchers  for  use  within  this  frame¬ 
work 

5.1  Line  Extraction 

Line  extraction  has  the  potential  for  multiple  uses  within 
AuRA  These  include  path  edge  extraction  for  use  by  stay- 
on-path  schemas,  landmark  identification  for  flnd-landmark 
schemas,  and  as  a  texture  measure  for  terrain  identification. 
Of  these,  the  first  two  are  currently  being  developed  for  use 
in  AuRA.  The  remainder  of  this  section  will  first  describe  the 
fast  line  finding  algorithm,  and  then  its  application  to  both  path 
following  and  localization  purposes. 

5.1.1  Fast  Line  Finder  (FLF) 

A  fast  line  finder  based  on  Burns’  algorithm  |!5|  has  been 
developed  by  Kahn,  Kitchen  and  Riseman.  It  is  a  two  pass  al¬ 
gorithm  which  first  groups  the  image  data  based  upon  coarse 


quantization  buckets  of  gradient  orientation  into  edge-support 
regions.  This  grouping  process  collects  pixels  of  similar  gradi¬ 
ent  orientation  into  separate  regions  via  a  connect ed  components 
algorithm.  The  gradient  magnitude  does  not  alfect  the  line  ex¬ 
traction  process.  A  line  is  then  fitted  to  the  resultant  edge  sup¬ 
port  region.  FLF  differs  from  the  original  Hums'  approach  In 
simplifying  the  specification  of  the  gradient  orientation  buckets 
and  the  extraction  of  the  representative  line  for  each  edge  sup¬ 
port  region.  Many  of  the  elementary  computations  can  be  sped 
up  further  through  the  use  of  a  conventional  pipeline  processor 
which  supports  a  look-np  table  and  convolution  processing 

Fragmentation  of  a  potentially  long  image  line  often  on  urs 
if  no  a  priori  knowledge  is  available  regarding  the  approximate 
orientation  of  the  line  in  the  image.  The  likelihood  of  exi  rai  t  mg 
a  particular  long  line  increases  by  tuning  the  bucket's  orienta¬ 
tion  to  be  centered  on  the  anticipated  orientation  of  a  road  edge 
or  other  line  model  in  the  image  through  the  use  of  available 
knowledge  extracted  from  LTM  or  previous  images. 

A  key  concept  is  action-oriented  perception .  performing  only 
that  computation  which  is  necessary  for  the  specific  lask  at  hand 
Features  available  within  the  FLF  algorithm  to  support  this  con¬ 
cept  are  described  in  the  remainder  of  thi«  paragraph  These 
features  include  the  ability  to  scope  the  image  (i  e  perform  line 
extraction  on  a  subwindow  of  the  image)  If  the  robot  lias  an 
approximate  knowledge  of  the  position  of  the  line  feature  be¬ 
ing  sought,  (derived  from  LTM,  the  spatial  error  map  and  or 
previous  images),  substantial  processing  reductions  i  an  be  at¬ 
tained  by  ignoring  those  portions  of  the  image  when  the  feature 
is  unlikely  to  occur.  In  addition  to  orient  at  ion,  the  FI.F  can 
be  adjusted  to  filter  lines  based  on  gradient  magnitude,  disper¬ 
sion,  size  of  the  region,  and  length  By  adjusting  these  tillers 
in  advance,  based  on  the  features  desired  (e  g  short  lilies  for 
texture,  or  long  lines  for  roads)  unnecessary  processing  '  an  lie 
minimized.  A  secondary  filtering  procedure  is  also  available  for 
removing  lines  after  the  fast-line  finder  has  been  run,  making  il 
possible  to  collect  different  sets  of  lines  with  different,  character¬ 
istics  with  only  a  single  run  of  the  more  t ime-ronsnming  FLF. 
This  is  possible  because  when  the  lines  are  produced,  statistics 
regarding  each  line  are  collected  ami  stored  with  the  endpoint 
data  for  later  reference. 

Figure  8a  is  an  image  of  a  sidewalk  scene  Figure  8b  shows 
the  results  of  the  FLF  using  the  default  bucket  orientation  for  I  he 
entire  image,  and  fig  8c  shows  the  results  with  the  buckets  tuned 
and  scoped  to  the  anticipated  road  edge  based  upon  t  he  internal 


425 


model  of  the  vehicle  position  and  orientation,  while  fig.  8d  shows 
the  results  with  the  buckets  tuned  to  horizontal  and  vertical 
edges,  filtering  to  retain  longer  lines  and  with  the  image  scoped 
above  the  horizon. 

5.1.2  Path  Following 

l 'sing  line  following  to  extract  path  boundaries  requires  the 
grouping  of  resultant  PLF  line  fragments  into  a  single  line  repre¬ 
senting  each  path  edge.  No  effort  is  being  made  to  condition  or 
modify  existing  paths  to  make  this  process  easier  (e.g.  by  adding 
stripes  cleaning,  etc  ).  'Hie  grouping  strategy  used  must  be  able 
to  deal  with  fragmentation  and  edge  discontinuities, such  as  path 
intersections,  leaves,  etc. 

If  the  uncertainty  of  the  vehicle  is  within  reasonable  limits, 
predict  ions  of  the  position  and  orientation  of  the  road  lines  in  the 
image  ['lane  can  be  made.  As  described  above,  there  are  two  dis- 
imct  components  of  road-following  (see  also  (4] ) :  the  bootstrap 
r  «tari-up  phase,  where  the  road  edge  is  determined  in  the  im- 

for  the  first  time;  and  the  feedforward  or  update  phase  - 
«  here  a  previous  image  is  used  to  guide  the  processing  for  the 
!n  \r  image  Line  finding  is  not  necessarily  the  best  strategy 
!•>:  initiall)  finding  the  road's  position.  Nonetheless  it  can  be 
r  a-ainably  effective  if  the  road  appears  on  a  global  map  of  the 
terrain  and  there  is  approximate  information  about  the  vehicle’s 
position  and  orientation.  These  are  both  present  within  AuRA, 
in  I  TM  and  the  spatial  error  map  respectively. 

The  feedforward  phase  assumes  that  the  approximate  posi- 
n  f  the  road  was  known  in  the  last,  image.  This  information, 
wj.* n  rruipled  with  the  roniman  l^d  translation  and  rotation  the 


robot  has  undertaken  since  the  last  image  acquisition,  can  be 
used  to  predict  where  and  at  what  orientation  the  road  edges 
will  occur  in  the  newly  acquired  image.  As  anyone  who  has 
worked  with  mobile  robots  knows,  the  motion  that  a  robot  ac¬ 
tually  takes  may  differ  quite  significantly  from  that  which  it  was 
commanded  to  perform.  Consequently,  there  must  be  a  consid¬ 
erable  margin  for  error  in  these  predictions  if  the  algorithm  is 
expected  to  be  robust.  Additionally,  there  must  be  some  mea¬ 
sure  of  the  confidence  in  the  line  produced  representing  the  road 
edge. 

Path  edge  grouping  proceeds  as  follows:  The  buckets  are 
tuned  based  on  the  anticipated  position  of  the  road  edge  in  feed¬ 
forward  mode;  in  bootstrap  mode  either  LTM  or  the  default 
buckets  would  be  used.  The  fast  line  finder  is  then  run,  produc¬ 
ing  line  fragments  in  the  approximate  orientation  of  the  path 
edge  (fig.  9a).  These  fragments  are  then  filtered  based  on  their 
distance  from  the  anticipated  image  line  and  the  expected  ori¬ 
entation  of  either  the  right  or  left  edge.  Again  t  he  amount  of 
tolerance  allowed  is  controllable.  This  yields  two  sets  of  line 
fragments  (one  for  each  path  edge  -  fig.  9b-c).  All  the  fragments 
above  the  vanishing  point  of  the  road,  (obtained  from  feedfor¬ 
ward  information),  are  discarded.  The  center  of  mass  of  the 
midpoints  of  remaining  line  fragments  is  computed,  each  mid¬ 
point  weighted  by  the  length  of  the  fragments  themselves.  The 
average  orientation  is  computed  in  a  similar  manner.  The  result¬ 
ing  point  on  the  line  and  computed  line  orientation  determine 
the  line  equation  for  each  road  edge.  The  left  and  right  edges 
are  then  used  to  compute  the  road  centerline  (fig.  9d).  The  cen¬ 
terline  is  the  basis  for  determining  the  rotational  deviation  of 
the  vehicle  relative  to  the  road’s  vanishing  point  as  well  as  the 
translational  deviation  from  the  road  centerline.  These  newly 


Figure  8.  Fast  Line  Finding 

a)  Original  sidewalk  image.  (b) 

(a)  b)  Default  bucket  orientation. 

e)  Buckets  tuned  to  road  edges. 

I  d)  Buckets  tuned  to  long  vertical  and 

horizontal  lines  above  horizon. 


426 


(c)  <d) 

Figure  9.  Path  finding  using  FLF 

a)  Miupnf  ,.f  FI.F  wlit'u  run  on  image  1  In. 

(tuned  bucket?  -  «ame  <  lie) 

I'}  Fragment?  left  after  filtering  and  windowing  for  left  path  edge. 

*')  Fiagment?  1**1  r  after  filleting  and  windowing  for  right,  path  edge. 

d)  Resultant  path  edges  and  computed  road  centerline. 

computed  pat  It  edges  are  then  used  as  the  models  for  the  next 
feedforward  step. 

The  total  length  of  the  line  fragments  used  in  producing  the 
path  edges  serves  as  a  measure  of  uncertainty.  If  this  value 
drops  below  t  spot  died  threshold,  special  processing  is  under¬ 
taken  This  includes  increasing  the  error  tolerances  and  margins 
in  the  FLF  to  see  if  a  more  confident,  line  ran  be  extracted  from 
the  same  image,  or  if  that  fails,  to  digitize  another  image  in  the 
event  that  a  passing  obstacle  blocked  one  or  both  path  edges. 
If  both  of  these  strategies  fail,  the  robot  will  reposition  itself 
slight l>  and  try  another  image.  If  this  yet  fails,  alternate  boot¬ 
strapping  methods  must  be  brought  to  hear. 

The  robot  has  already  been  able  to  successfully  navigate  both 
an  outdoor  sidewalk  and  an  indoor  hall  using  the  FLF.  Approxi¬ 
mately  1M  CPI  seconds  (VAX-750)  are  required  for  each  step  to 
provide  the  robot  information  for  traveling  5.0  feet  ahead.  The 
512  by  512  image  digitized  on  a  Ciotild  IP8500  is  averaged  to  256 
by  256  before  line  extraction  This  time  will  be  reduced  by  using 
pipelined  hardware  available  on  t  he  digit  izer  The  vehicle  servos 
on  the  computed  renter  line  position,  correcting  both  orientation 
and  translational  drift  as  it  proceeds. 

5.1.3  Landmark  Identification  through  Line  Finding 

Vehicle  localization  can  be  addressed  by  the  line  finding  algo¬ 
rithm  using  flat  a  stored  in  LTM.  Localization  is  simply  orienting 
the  vehicle  relative  to  its  global  map;  in  other  words  getting  its 
bearings  We  do  not  propose  that  lines  are  the  only  mechanism 
for  localization,  but  should  serve  in  conjunction  with  other  rel¬ 
evant,  algorithms  In  the  role  of  a  confirmation  mechanism,  or 


(h) 


k  Ail 


w 

Figure  10.  Road  extraction  via  region  segmentation 

a)  Sidewalk  image. 

b)  Resultant  extracted  image  representing  ro.nl. 

c)  Resultant  extracted  image  representing  central  portion  of  sky. 


427 


for  tracking  frame- to-frame  a  previously  identified  landmark  fea¬ 
ture,  FLF  localization  is  well  suited.  Extracting  the  edges  of  a 
path  as  described  above  also  provides  information  for  localizing 
the  vehicle  as  well,  assuming  the  path  is  represented  in  the  world 
map. 

Long,  strong  vertical  lines  and  corners  derived  from  such  lines 
are  probably  the  most  appropriate  general  category  of  lines  that 
is  suitable  for  this  application.  Edges  of  buildings,  telephone 
poles,  lampposts  or  doorframes  can  be  tracked  using  the  line 
finder.  Figure  8d  shows  the  result  of  running  the  FLF  on  image 
fig.  8a  with  the  buckets  and  filters  tuned  for  long  horizontal  and 
vertical  lines  of  high  gradient  magnitude.  This  orientation  can 
be  used  to  identify  the  roofs  of  buildings  against  the  sky  or  road 
intersections  directly  in  front  of  the  vehicle.  By  windowing  the 
image  for  a  certain  landmark  based  on  the  position  of  the  vehicle 
as  indicated  by  the  spatial  error  map  and  a  priori  knowledge  of 
the  global  coordinates  and  dimensions  of  the  feature  in  question 
(from  LTM),  it  becomes  possible  to  isolate  features  such  as  the 
corner  of  a  building  by  combining  the  evidence  from  both  hori¬ 
zontal  and  vertical  lines.  This  then  can  be  used  to  constrain  the 
positional  error  of  the  vehicle  by  backprojecting  the  2-D  data  to 
3-D  world  coordinates  when  combined  with  the  knowledge  of  the 
height  of  the  feature. 

5.2  Fast  Region  Seginenter  (FRS) 

FRS  is  a  region  extraction  algorithm  developed  in  a  manner 
akin  to  the  fast  line  finder,  but  based  upon  similarity  of  color 
and  intensity  features.  It  functions  by  first  defining  a  look-up 
table  that  is  used  for  classifying  an  input  image.  The  input  im¬ 
age  used  can  be  an  intensity  image,  a  gradient  image,  a  color 
plane,  etc.  This  input  image  can  be  scoped  (windowed)  as  is 
the  case  with  the  FLF.  The  look-up  table  maps  ranges  of  pixel 
values  to  specific  region  labels  and,  as  before,  can  be  loaded  in  a 
top-down  manner  based  upon  stored  knowledge  or  sampled  data 
from  previous  images.  Available  knowledge  can  be  used  to  de¬ 
fine  expected  ranges  of  spectral  attributes  of  interesting  objects 
(fig.  7,10,11).  The  resulting  classified  image  is  then  subjected  to 
a  region  extraction  algorithm  which  groups  the  classified  pixels 
into  regions.  Statistical  data  is  then  collected  regarding  each 
region  This  algorithm  has  been  motivated  by  histogram-based 
segmentation  algorithms  (34) ,  but  achieves  great  simplification 
via  constraints  from  stored  object  knowledge  in  LTM  or  the  re¬ 
sult  of  processing  previous  frames,  to  define  the  look-up  table 
ranges  for  objects  in  the  next  frame. 

The  speed  of  this  algorithm  arises  from  the  use  of  the  look-up 
table  to  provide  a  quick  mapping  to  the  image.  The  connected 
components  routine  is  then  run  on  a  restricted  portion  of  the 
image  selected  through  the  use  of  top-down  map  constraints. 

This  segmentation  can  be  used  for  road  extraction  as  in 
|16,17].  Preliminary  experimentation  using  intensity  images  can 
be  seen  in  figure  10.  Fig.  10a  shows  the  original  image  and 
fig  10b  the  region  extracted  representing  the  road.  The  statis¬ 
tics  collected  for  the  road  region  are  then  used  for  providing  the 
expectations  (feedforward)  for  the  next  image  in  the  sequence 
(as  in  16). 

Landmark  extraction  can  be  handled  similarly.  The  centroid 
of  the  landmark  can  be  used  for  localization  purposes,  in  contrast 
to  the  edge  detection  methods  used  by  the  FLF  or  the  corner 
detection  approach  used  with  the  Moravec  operator  described 
below  A  bright  yellow  road  sign  (very  dark  in  the  blue  sensory 
band)  is  segmented  for  localization  purposes  in  figure  11. 

5.3  Depth  from  Motion 

Passive  navigation  by  the  determination  of  the  position  of  en¬ 
vironmental  objects  through  vision  is  an  important,  sensor  strat¬ 


i') 


Figure  11.  Landmark  identification  via  region  segmentation 

a)  Original  sign  image  (combined  RGB  intensity). 

b)  Blue  plane  of  (a)  chosen  for  analysis  due  to  the  spectral 

data  of  anticipated  landmark  (available  from  LTM) 

c)  Extracted  region  representing  sign. 


egy  for  AuRA.  The  motion  research  group  wiiliin  VISIONS  has 
long  explored  the  extraction  of  depth  from  motion  [18,19,20]. 
A  more  rec  ent  aigorit  hm,  developed  by  Rharwani,  Riseman  and 
Hanson,  uses  a  sequence  of  frames  t.o  incrementally  refine  posi¬ 
tional  estimates  of  objects  over  time  It  can  be  used  in  mobile 
robotics  for  obstacle  avoidance,  position  localization,  and  as  ev¬ 
idence  m  object  identification. 

5. 3. 1  Algorithm 

A  brief  sketch  of  the  multiple  frame  dept h-from-motion  al¬ 
gorithm  developed  by  Hharvvani,  Riseman,  and  Hanson  follows. 
The  reader  is  referred  to  21,28,  for  the  details  of  this  approach. 
The  algorithm  allows  refinement,  of  depth  over  time  up  to  some 
delectable  limit,  while  maintaining  a  constant  computational 
rate  I'li i ;s  wry  important  for  real-time  processing 

Die  problem  >>l  n*cowniig  depth  from  motion  in  a  sequence  of 
images  again  inwlws  lie*  decomposition  of  the  problem  into  two 
component--  -i art-up  ami  updating  This  algorithm  makes  t. he 
assumption  that  tin  'Miner  a  is  undergoing  pure  t  ranslat  ional  mo- 
tic ii  and  d  <  p. -si  h -a  . .f  the  luois  of  expansion  (POT)  is  known 
within  -om-  re  i -on. tide  «  minted  degree  of  accuracy.  (  These  as- 
sunij'Untt'  nr-  t. ••••  1 1 •  .■•-snrih  safe  when  using  real  world  images. 

1  '  aim  r-  v  i-«  rn-  -u  md  I  Oh  recovery  are  problems  tltat  need  to 
be-  so!  wo  \  .h".  U"" i1  >i i  appears  m  sec  5  ,’$.2).  This  implies  that 
*  he  image  .*!<pi  ■  -.  m*  nt  piths  for  a  "t  at  ir  environment  al  feat  lire 
are  con<t r. lined  to  u  -re  m  a  -i raigbf-liru-  emanating  radially 
from  th«  I'Of.  \n  interest  operator  is  used  to  extract,  points 
of  1 1 i o  1 1  ire  :  !  .nir<i>t  m  the  image  that  are  unlikely 

o,  con-*  !  i:-  w*  li  with  false  match  points  in  future  frames.  It.  is 
assumed  that  the  ohsi.-u  l.  s  or  landmarks  will  exhibit  some  such 
p"!M*.  ,  n  tiu  ir  hoiindar'es  'This  is  probable  if  the  backdrop 
is  bhind  (*■  i*  ’I  ;  r .  .»■  i  tt'.elf)  or  b\  deliberately  retrieving  only 
“in? er»  .ii r  bindm-irk"  fr  mi  I  I’M  However,  it  should  be  ex¬ 
pected  that  u  'er-  -t  p.  infs  wm  be  extracted  from  relevant  and 
noifie|e\.int  !•.  nn  e;  i  nt- 

1  In  :  •{*■::<  c  ;•*.•. hh-rn  i>  the  principal  difficulty,  how 

i  an  >>?.»•  be  -ure  ti.  .t  the  feature  m  one  frame  corresponds  to 
the  s.inn  feat  nr*  m  I  fie  next  frame  after  the  robot  lias  under¬ 
gone  1  Mti-.lv  i. I  h«  s * . i r 5  -tip  plume  involves  finding  initial  cor- 

rui  feature  ,  , :  t  iice»  !>et  ween  the  first  two  frames,  while 

the  update  ph  me  mw'lvo  the  us.-  of  the  start-up  analysis  and 
t  h*1  cons,  fjiicn!  approximate  depth  values  to  restrict  the  search 
area  for  c,m  responding  m. it  tin"  at  a  higher  match  resolution  in 
"ui.se. jiu-nf  fi  -t - .:•••-  tlui"  ndu*  im*  computation  and  providing  re¬ 
finement-  •  >!  original  d<*p*  h  estimates  Work  by  Snyder  122], 
a<!d r*-"sim<  >fn  i •  ♦  s 1 1 1 -  of  umerl amty  in  this  type  of  motion,  is 
fundament  .v  i o  etlicieiii  us**  of  previous  correct  correspondences 
in  conM  r. umiig  ih  -  ma'ch  1 1-  future  frames. 

A -" 1 1 1 1 1 1 1 1 ti  l  In-  roin  a  is  i  raveling  at  a  know  n  velocity,  the  pixel 
drpi.ii  emeiit "  lound  l  etwic-n  the  lir"t  two  images  of  a  sequence 
(star*  up}  *  an  be  u>cd  to  further  reduce  the  search  for  feature 
rnati  him*  m  "H**  i*-"iw  frames,  once  the  images  have  been  regis¬ 
tered  -•>  that  *ion-t raimbit ional  motion  of  the  camera  has  been 
"iiMinif.  I  out  Known  ■-.•ii"'or  motion  lea* Is  to  a  constraint  on 
the  match  path  arid  approximate  depth  (from  start-up)  con¬ 
strain"  t  he  portion  of  t  lie  path  to  be  matched  Progressive  re¬ 
finement-,  .  an  [„>  made  in  t  fie  est  imat  ion  of  feature  displacement 
and  fietn  e  distaufe  t<i  a  relevant,  feature 

Ilifb  rent  strategies  such  as  iiistogramming  the  collection  of 
points  on  the  l»;*"is  . .  f  depitfi,  determining  orientation  of  surfaces 
based  upon  t fie  depth  of  several  points  on  associated  regions,  or 
identifying  landmarks  by  lorrelaling  distance*  from  I  fie  viewer 
with  the  objeets  in  the  environmental  map  in  l/TM.  can  be  used 
to  extra*  I  objects  from  the  environment  'This  data  can  then  be 
used  to  provide  information  to  the  motor  schema  manager  for 
effecting  evasive  a*  lion  in  the  case  of  obstacles  or  for  use  in  local¬ 
ization  in  the  caw  of  landmark  location  'The  specific  application 
of  this  technique  to  both  of  these  cases  appears  below. 


(b) 


(<) 

Figure  12.  Depth  from  Motion  R»  stilts 
a-h)  Two  image  sequence  (distance  traveled  is  U  r*  in) 
c)  Interest  point?  tracked  (from  image  1) 


The  slop?  are  at  a  distant e  of  |  t..r>  meteis  in  It  mu-  (  i)  The  result? 
for  the  interest  points  associated  with  the  step"  ,it«  wiilnn  10  peicent 
of  the  ground  truth. 

(from  [28])  (image  sequence  from  f'MP  tenee.it. *i ) 


I 

\ 

I 


( 

I 


r>.  3.2  Applications 

A  primary  goal  of  the  depth-from-motion  algorithm  is  to  he 
able  to  provide  information  about  the  distance  of  an  object  lying 
in  the  path  of  the  robot.  In  obstacle  avoidance  applications, 
computational  requirements  can  be  made  tractable  by  restricting 
the  processing  to  interest  points  (i.e.  trackable  image  points  of 
high  contrast  and  curvature)  and  only  to  those  that  are  lying 
within  the  current,  path  of  the  robot. 

Figure  12  illustrates  some  preliminary  results  of  using  the 
depth-from-motion  algorithm  for  obstacle  avoidance.  It  can  be 
seen  that  the  results,  (accuracy  within  10%  in  the  recovery  of 
obstacle  depth),  indeed  look  promising  although  continual  re¬ 
search  effort  will  be  made  to  validate  the  initial  results.  The 
biggest  problems  encountered  in  the  use  of  this  algorithm  in  mo¬ 
bile  robotics  include  first,  accurate  recovery  of  the  FOE,  which 
can  he  minimized  through  accurate  calibration,  and  second,  en¬ 
suring  registration  of  the  images.  Stabilizing  the  camera  with  a 
gyroscopic  platform  affords  a  hardware  solution  to  the  registra¬ 
tion  problem.  A  software  solution  can  be  achieved  by  applying 
the  correlation  matching  process  to  points  near  and  above  the 
horizon,  i.e  distant  (hence  relatively  unmoving)  features  that, 
ran  be  registered  from  frame  to  frame.  The  algorithm  is  quite 
sensitive  to  rotation  so  every  effort  should  be  made  to  minimize 
nr  eliminate  any  pitch,  roll,  or  yaw  movements  of  the  camera 
relat  ive  to  the  scene. 

The  motion  algorithm  can  be  used  for  landmark  identifica¬ 
tion  as  well  This  is  actually  a  simpler  task  than  obstacle  avoid¬ 
ance  in  many  respects  due  to  the  availability  of  I,TM  knowledge 
to  guide  processing  in  a  top-down  manner.  The  approximate  dis¬ 
tance  of  a  known  landmark  to  the  vehicle  in  a  restricted  portion 
of  the  overall  image  restricts  the  computation  required  substan¬ 
tially  When  approximate  ranges  for  the  distance  to  an  obstacle 
are  known,  the  algorithm  will  perform  more  robustly  than  when 
underconst rained.  Portions  of  the  image  can  be  searched  that 
are  outside  of  the  obstacle  avoidance  regions  As  these  are  usu¬ 
ally  further  from  the  FOE  than  points  in  the  robot’s  direction  of 
motion,  greater  pixel  displacements  will  occur  and  hence  better 
results  in  the  depth  analysis.  There  is  also  a  bottom-up  strategy 
-  obtain  the  depth  of  a  subset  of  points  that  are  on  an  antici¬ 
pated  object  and  the  expected  surface  orientation  from  STM  and 
search  the  depth  points  returned  for  a  match.  By  overlapping 
region  and  line  information,  signs,  buildings,  telephone  poles, 
etc  .  could  he  located  within  the  robot’s  frame  of  reference 


|  5.4  Interest  Operators 

I  Interest  operators  are  used  in  computer  vision  to  pick  out 

pixels  associated  with  regions  of  high  curvature  and  contrast. 
The  Moravcc  operator  |25|  and  the  Kitchen-Rosenfeld  gray-level 
corner  detection  interest  operator  [26]  are  two  well-known  ex¬ 
amples  The  depth  from  motion  algorithm,  described  in  section 
.">  3.  uses  an  interest,  operator,  (currently  Moravec’s),  to  deter¬ 
mine  the  points  on  which  to  run  the  correspondence  algorithms 
from  frame  to  frame. 

Interest  operators  are  quite  primitive  as  a  stand-alone  meth¬ 
od  for  obtaining  information  for  navigation.  Their  primary  ad¬ 
vantage  is  speed  By  combining  knowledge  available  from  long¬ 
term  memory  with  image  data,  it  becomes  possible  to  use  interest 
operators  to  confirm  the  position  of  landmark  corners.  A  clear- 
cut  example  would  be  the  position  of  a  building  corner  against 
the  sky.  W  hen  combined  with  knowledge  from  the  robot’s  posi¬ 
tional  error  map  and  available  data  from  LTM,  this  method  can 
he  used  in  restricted  circumstances  to  confirm  the  position  of  a 
real  corner  as  predicted  by  the  line-finding  intersection  method 
(sec  5  1.3)  A  succinct  description  of  the  Moravec  operator  ap¬ 
pears  in  |27|  for  those  readers  unfamiliar  with  its  operation. 


As  the  interest  operator  provides  a  measure  of  distinctive¬ 
ness,  (how  different  the  pixel  region  is  from  its  surroundings),  the 
Moravec  operator  can  also  be  used  as  a  trigger  event  for  spawn¬ 
ing  avoid-obstacle  schema  instantiations  When  distinctive 
events  occur  against  the  relatively  unchanging  road  backdrop, 
this  would  indicate  a  potential  obstacle.  This  low-cost  focus  of 
attention  mechanism  permits  the  concentration  of  higher-cost 
computational  effort  in  such  likely  sitations. 

6.  Summary 

AuRA  is  a  mobile  robot  system  architecture  that  provides 
the  flexibility  and  extensibility  that  is  needpd  for  an  experimen¬ 
tal  testbed  for  robot  navigation.  By  allowing  for  the  incorpo¬ 
ration  of  a  priori  knowledge  in  long-term  memory,  a  variety  of 
different  perceptual  strategies  can  be  brought  to  bear  by  the 
robot  in  achieving  its  navigational  goals  In  particular,  the  indi¬ 
vidual  motor  schemas  and  their  associated  perceptual  schemas 
can  be  added  or  deleted  from  the  overall  system  without  forcing 
a  redesign. 

A  hierarchical  planner  determines  the  initial  route  as  a  se¬ 
quence  of  legs  to  be  completed  over  known  terrain  with  pre¬ 
dicted  natural  landmarks.  Typical  objects  encountered  in  ex¬ 
tended  man-made  domains  (the  interior  of  buildings,  and  out¬ 
door  settings  with  buildings  and/or  paths  present)  provide  the 
information  necessary  for  localization  The  information  gleaned 
from  LTM  is  used  to  guide  the  pilot  in  the  select  ion  and  param¬ 
eterization  of  appropriate  motor  schemas  and  their  associated 
perceptual  schemas  for  instantiation  in  the  motor  schema  man¬ 
ager.  Actual  piloting  (sensor-driven  navigation)  is  conducted  by 
the  motor  schema  manager.  Positional  updating  occurs  concur¬ 
rently  with  the  actual  path  traversal. 

The  vision  algorithms  to  be  used  in  AuRA  encompass  a  wide 
range  of  computer  vision  techniques.  These  include  primitive 
interest  operators,  more  sophisticated  line-finding  and  region 
segmentation  algorithms,  a  multiple  frame  depth-from-motion 
algorithm,  and  a  scene  interpretation  system.  Each  approach 
has  its  purpose,  advantages  and  disadvantages  for  use  in  mobile 
robot  navigation.  In  all  cases,  however,  versions  of  vision  algo¬ 
rithms  have  been  developed  which  will  extract  image  features 
rapidly  (at  the  expense  of  reliability  in  some  cases).  In  addi¬ 
tion,  the  control  of  all  algorithms  attempts  to  use  a  top-down 
strategy  of  restricting  processing  to  windows  based  upon  LTM 
or  previous  frames.  In  this  manner,  real-time  processing  may  be 
achieved  for  certain  interesting  navigation  tasks  that  might  not 
have  been  feasible  until  more  powerful  parallel  hardware  arrives. 

Ongoing  and  future  work  includes  the  refinement  of  the  in¬ 
dividual  perception  algorithms  used,  increased  real-time  mobile 
robot  experiments  as  parallel  hardware  is  integrated  into  the  sys¬ 
tem,  and  addressing  the  considerable  system  integration  prob¬ 
lems  involved  with  the  interfacing  of  the  individual  components 
being  developed. 

Acknowledgments 

The  authors  would  like  to  thank  the  members  of  the  Lab¬ 
oratory  for  Perceptual  Robotics  and  VISIONS  groups  for  their 
assistance  in  the  development  of  software  and  preparation  of  this 
paper.  Special  thanks  are  due  to  Raj  Bharwani,  Phil  Kahn  and 
Les  Kitchen. 

References 

1.  Parma,  C.,  Hanson,  A.  and  Riseman  E.,  "Experiments  in  Schema- 
driven  Interpretation  of  a  Natural  Scene”,  COINS  Tech.  Rep.  80-10, 
Comp,  and  Info.  Sci.  Dept..,  Univ.  of  Massachusetts,  1980. 

2.  Weymouth,  T.,  "Using  Object  Descriptions  in  a  Schema  Network 
for  Machine  Vision",  Ph.D.  Dissertation,  COINS  Tech.  Rep.  86-24, 
Univ.  of  Massachusetts,  Amherst,  May  1986. 


430 


3.  Shell,  H.  and  Signarowski,  (J.t  “A  Knowledge  R«.;u.--nt  ution  t.  i 

Roving  Robots” ,  IEEE  Second  .*n  Ami.  Intel  \pp  .  pp  u>\. 

628,  December  1085. 

4.  Waxman,  A  M  ,  be  Moigiie.  I.,  and  Sriniv.is.m.  U  .  YEuul  Niviga- 
tion  of  Roadways”,  Proc.  IREK  I nt .  C.uif  Robot  hs  md  Anii.in.it i*>n . 
St.  Louis,  Mo.,  pp.  862-867,  1085. 

5.  Ark  in,  R.,  “Internal  Control  t,f  a  Robot:  An  Endocrine  Analogy”, 
unpublished  paper,  Dec.  1084. 

6.  Arkin,  R.,  “Path  Planning  for  a  Vision- based  Mobil,-  |?.>h.*r\  t<.  ap. 
pear  in  MOBILE  ROBOTS  -  >77/'.’  Pmr.  Vol  ?  >?.  1  iki.  Ai-..  COINS 
Technical  Report  86-48.  Dept,  of  Computer  and  Inf#  •t  mat  i#ui  "h  injif  e, 
Univ.  of  Mass.,  1086. 

7.  Arkin  R.,  “Motor  Schema  Base* I  Navigation  tor  i  M  .bile  R..b,.t: 
An  Approach  to  Programming  bv  Behavior ’.  to  appeal  in  1’roi  .  I iv K P • 
International  Conference  on  Robotic-  and  Automat  mu,  R  aleigh.  YC  . 
1987. 

8.  Arkin,  R.C.,  Ph.D.  Proposal,  Computet  and  Inf-  i  mat  em  ><  lence 
Department,  University  of  Ma'^u#  hu'eH  -  Spring  I « »eu 

9.  Crowley,  “Navigation  for  an  liiieltig#MU  Mobile  lb  l,.  t  .  c\1P 
Robotics  Institute  Tech.  R*-p  .  CM  I  R  l-TR-s  I  is  lost 

10.  Chatila,  R.  and  Laumond,  I  P.,  “Position  it|.»#u«,ng  and  Consis¬ 
tent  World  Modeling  for  Mobile  Robots",  Pro-  IEEE  |,,r  c#  nf  R.,i, 
and  Auto.,  St.  Louis,  Mo.,  pp  I  {8-145,  pW5 

11.  Giralt,  G.,  “Mobile  Robots".  Artificial  fut • -llnvn.  *•  and  Roborjrs. 
ed.  Brady,  Gerhardt,  and  Davidson.  Spiingei- Void<u>.  N  \  ]  t  >  \S| 

ries,  pp.  365-394,  1984. 

12.  Giralt,  G.,  Chatila.  R  .  and  Vab*.t.  \f  .  “An  Integrated  Naviga¬ 
tion  and  Motion  Control  System  f.»r  Ant..nonioii.-  Mult  i, ,  n  -u  v  Mol.il, 
Robots”,  Robotics  Reseat  eh.  Tim  Fir<!  Intel  ii.im-tiai  >>  mpooum  MIT 

Press,  pp.  191-214,  198',. 

13.  Krogh,  B..  “A  Generalized  potential  Field  Appi.-.nh  i  -  detach* 
Avoidance  Control",  SME  HI  7>.  *»).-.•:  /'•;/•*'  \b--t  ;hj  joki 

14.  Krogh,  B.  and  Thorpe,  C,  "integi  .»*  *d  Path  Pimmio;  and  Dy¬ 
namic  Steering  Control  for  Anionoinoii-  V.  In*  h- .  fill 
Robotics  anti  Auto.,  1986  pp  I6n |- 1 • ,« .* » 

15.  Burns,  J..  Hanson.  A  .  Ris*qn.iu.  K.  Exit  m-.m:  •m  uglii  lone-’. 
Proc.  IEEE  7th  Int.  Conf.  «>n  {'.nion  Re«  •  •gtm  i-  u.  M  .nti<-.il  c.m. 

pp.  482-485.  1984. 

16.  Thorpe,  C.  and  Kanade.  T  .  ”V»«)on  and  N  iw  oi-.n  f.-i  i  h-  (  \JI 

Navlab”,  to  appear  in  MOBILE  Roll* » I  s  -  >77/  /*?...  ;  ;  ,  pis,, 

17.  Tou,  J..  ”  Softwan  A»c lut •** .  tm,  ..|  \\  i  In,,.  \  1 . ,  . t .  j.  .  ||„ui  ; 
Robots”,  Optical  Engineering,  25- 1.  pp  12"- tr.  m.h.Ii  I  »*»« 

18.  Williams,  T..  “Computer  Interpretation  «•!  Dve  mu.  lm  »L-  H-un 
a  Moving  Vehicle",  COINS  Te< him  ,n  R.  p  .it  8  1  22  j  '.pi  .  .1  <  i  -iti  i  •  1 1 1 .  i 
and  Information  Science,  Uinv  of  M.i«s  .  Mo  I  *  »•*  | 

19.  Lawton,  D.,  “Processing  Dynamic  Image  s,.,p,. , ..  ,  Moving 

Sensor”,  Ph.D.  dissertation.  COINS  IV#  Initial  R-,  n  *)■•'>  |  >.  pt 
Computer  and  Information  Science.  Univ  ,  I  M  ,  -  » ' ■  -i  i 


20.  Anandan,  P-,  ”  Measuring  Visual  Motion  from  Image  Sequences, 
PhD.  Dissertation,  University  of  Massachusetts  at  Amherst,  Jan.  1987. 

21.  Bharwani,  S.,  Riseman,  E.  and  Hanson,  A.,  "Refinement  of  Envi¬ 
ronmental  Depth  Maps  over  Multiple  Frames”,  Proc.  IEEE  Workshop 
on  Motion  Representation  and  Analysis ,  pp.  73-80,  May  1986. 

22  Snyder,  M.,  “The  Accuracy  of  3D  Parameters  in  Correspondence- 
based  Techniques” ,  COINS  Technical  Report  80-28,  Dept,  of  Computer 
and  Information  Science,  Univ.  of  Mass.,  July  1986. 

23.  Hanson,  A.  and  Riseman,  E.,  “VISIONS,  A  Computer  System  for 
Interpreting  Scenes”,  COMPUTER  VISION  SYSTEMS  (Hanson  and 
Riseman  eds,),  Academic  Press,  pp.  303-333,  1978. 

24.  Hanson,  A.,  and  Riseman,  E.,  “A  Summary  of  Image  Understand¬ 
ing  Research  at  the  University  of  Massachusetts”  ,  COINS  Tech.  Report 

83- 35,  Dept,  of  Computer  and  Information  Science.  U|,jv.  of  Mass.. 
1983. 

25.  Moravec,  IE,  Robot  Rover  Visual  Navigation,  UMI  Pres.-.  J98E 

26.  Kitchen,  L.  and  Rosenfeld,  E,  “Grey-level  'orner  Detection”.  Tech¬ 
nical  Report  887,  Comp.  Sci.  Ctr.,  U,  Maryland.  College  Park,  Md. 
1980. 

27.  Ballard,  D.  and  Brown,  COMPUTER  VISION.  Prentice-Hall. 
1982,  pp.  69-71). 

28.  Bharwani,  S.,  Riseman,  E.  ami  Hanson,  A. . “Mult ifraim*  Compu¬ 
tation  of  Accurate  Depth  Maps  using  Uncertainty  Analysis",  COINS 
Tech.  Report,  (in  preparation).  Dept,  of  Computer  and  Information 
Science,  Univ.  of  Mass.,  1986. 

29.  Hanson  A.  and  E,  Riseman,  E.,  “The  VISIONS  Image  Understand¬ 
ing  System  -  1986”,  in  ADVANCES  IN  COMPUTER  VISION,  (Chris 
Brown  ed  ),  to  be  published  by  Erlbaum  Press. 

30.  Weems.  “Image  Processing  on  a  Content  Addressable  Array 
Parallel  Processor”,  Ph.D.  Dissertation  and  COINS  Tech.  Report 

84- 14,  Dept,  of  Computer  and  Information  Science,  Univ  of  Mass  , 
Sept  1984 

31.  Draper.  B.,  Hanson,  A.  and  Riseman  ,E.,  “A  Software  Environ- 
ir.  ’lit  for  High  Level  Vision”,  COINS  Tech.  Report  (in  preparation). 
Dept,  of  Computer  and  Information  Science,  Univ.  of  Mass  .  1986 

32.  Parodi,  A  M.,  “Multi-Goal  Real-time  Global  Path  Planning  for  an 
Autonomous  Land  Vehicle  using  a  High-speed  Graph  Search  Proces¬ 
sor".  Proc.  I  EFT,'  Int.  (7, nf.  Robotics  and  Automation.  St.  Loins. 
Mo.,  pp  HU-167.  1985. 

{{.  Arbib.  M..  “Perceptual  Structures  and  Distributed  M«»toi  Coti- 
II. .r.  HANDBOOK  or  PHYSIOLOGY  -  The  Nervous  System  II  (V 
Ht.>«<k-  #*d).  pp.  1 449- 1465, 1981. 

•  Kohler.  R..  “Integrating  Non-semant ir  Knowledge  into  Imac- Seg- 
meiit.it  ion  Pi ocesses”  .  Ph.D.  Thesis.  COINS  TR-84-O  E  Dept  of  (  •un- 
put. •;  and  liif'um.ilion  S<  ieju  e,  Univ.  of  Mass.,  1981 


431 


Guiding  an  Autonomous  Land  Vehicle 
Using  Knowledge-Based  Landmark  Recognition 


Hatem  Nasr,  Bir  Bhanu  and  Stephanie  Schaffer 

Honeywell  Systems  and  Research  Center 
3660  Technology  Drive,  Minneapolis,  MN  55418 


ABSTRACT 


In  the  Autonomous  Land  Vehicle  (ALV)  application  scenario,  a 
significant  amount  of  positional  error  is  accumulated  in  the  land 
navigation  system  after  traversing  long  distances.  Landmark 
recognition  can  be  used  to  update  the  land  navigation  system  by 
recognizing  the  observed  objects  in  the  scene  and  associating 
them  with  the  specific  landmarks  in  the  geographic  map 
knowledge-base.  In  this  paper  we  present  a  novel  landmark 
recognition  technique  based  on  a  perception-reasoning-action 
and  expectation  paradigm  of  an  intelligent  agent.  It  uses 
extensive  map  and  domain  dependent  knowledge  in  a  model- 
based  approach.  It  performs  spatial  reasoning  by  using  N-ary 
relations  in  combination  with  negative  and  positive  evidences. 
Since  it  can  predicts  the  appearance  and  disappearance  of 
objects,  it  reduces  the  computational  complexity  and  uncertainty 
in  labeling  objects.  It  provides  a  flexible  and  modular 
computational  framework  for  abstracting  image  information  and 
modeling  objects  in  heterogeneous  representations.  We  present 
examples  using  real  ALV  images. 


I.  INTRODUCTION 


In  order  to  accomplish  missions  such  as  surveillance,  search  and 
rescue  and  munitions  deployment,  an  Autonomous  Land  Vehicle 
(ALV)  has  to  travel  long  distances.  This  results  in  a  significant 
amount  of  positional  error  in  the  land  navigation  system. 
Landmark  recognition  is  used  to  update  the  land  navigation 
system,  thus  guiding  the  ALV  to  remain  on  its  proper  course. 
Landmarks  of  interest  include  telephone  poles,  storage  tanks, 
buildings,  houses,  gates,  etc. 

Model-based  vision  has  been  a  popular  paradigm  in  computer 
vision  since  it  reduces  the  problem  complexity  and  no  learning  is 
involved.  Binford  [6]  has  given  a  summary  of  model-based 
vision  work.  He  has  described  several  systems  including  the 
work  of  Brooks  [7]  on  ACRONYM,  Riseman  and  Hanson's 
[12]  work  on  VISIONS,  and  Nagao  and  Matsuyama's  [14] 
work  on  the  analysis  of  complex  aerial  photographs.  McKeown 
et  al  [13)  have  used  map  and  domain  specific  knowledge  in  the 
SPAM  rule-based  systems  for  the  interpretation  of  airport  scenes 
in  aerial  images.  Hwang  [10]  has  also  used  domain  knowledge 
to  guide  interpretation  of  suburban  house  scenes  in  aerial 
imagery.  He  has  used  test-hypothesize-act  sequence  to  generate 
large  number  of  hypotheses  which  are  then  integrated  into  a 
consistent  interpretation.  Bhanu  [1-4]  has  used  several 
modeling  and  relaxation  matching  techniques  for  the  recognition 


of  2-D  and  3-D  nonoccluded  and  occluded  objects.  As 
compared  to  all  the  previous  related  work,  as  mentioned  in  the 
above,  the  paradigm  of  an  intelligent  agent  (like  the  ALV)  which 
we  have  used  here  is  based  on  the  perception-reasoning-action 
and  expectation  cycle.  Thus  we  have  an  expectation-driven, 
knowledge-based  landmark  recognition  system  called 
PREACTE  (Perception-REasoning-ACTion  and  Expectation), 
that  utilizes  a  priori,  map  and  perceptual  knowledge,  spatial 
reasoning  and  knowledge  aggregation  methods.  In  contrast  to 
the  work  of  Davis  [9],  explicit  knowledge  about  the  map  and 
landmarks  is  assumed  to  be  given  and  it  is  represented  in  a 
relational  network.  It  is  used  to  generate  an  Expected  Site  Model 
(ESM)  given  the  ALV  location  and  its  velocity.  Landmarks  at  a 
particular  map  site  have  their  3-D  models  stored  in 
heterogeneous  representations.  The  vision  system  generates  a  2- 
D  and  partial  3-D  scene  model  from  the  observed  scene.  The 
ESM  hypothesis  is  verified  by  matching  it  to  the  image  model. 
The  matching  problem  is  solved  by  using  object  grouping  and 
spatial  reasoning.  Positive  as  well  as  negative  evidences  are 
used  to  verify  the  existence  of  each  landmark  in  the  scene.  The 
system  also  provides  feedback  control  to  the  low-level  processes 
to  permit  adaptation  of  the  feature  detection  algorithms 
parameters  to  changing  illumination  and  environmental 
conditions. 

In  the  following,  we  present  the  details  of  the  PREACTE  system 
and  examples  of  landmark  recognition  using  real  ALV  imagery. 
It  is  worth  mentioning  that  PREACTE  is  a  component 
subsystem  of  a  much  larger  system.  Other  parts  of  the  system 
will  be  referred  to  but  not  described. 


II.  CONCEPTUAL  APPROACH 


The  task  of  visual  landmark  recognition  in  the  autonomous 
vehicle  scenario  can  be  categorized  as  (a)  uninformed  and  (b) 
informed.  In  the  uninformed  case,  given  a  map  representation, 
the  vision  system  attempts  to  attach  specific  landmark  labels  to 
segmented  image  regions  of  an  arbitrary  observed  scene  and 
infers  the  location  of  the  vehicle  in  the  map  (world).  On  the 
other  hand,  in  the  informed  case,  while  the  task  is  the  same  as 
earlier,  there  is  a  priori  knowledge  (with  a  certain  level  of 
certainty)  of  the  past  location  of  vehicle  in  the  map  and  its 
velocity.  It  is  the  informed  case  that  is  of  interest  to  the 
discussion  of  this  paper.  There  are  a  number  of  assumptions 
made  in  this  landmark  recognition  approach.  They  include:  1) 
a  forward  looking  fixed  camera  model  is  given,  2)  traversal  by 
the  vehicle  is  allowed  only  on  defined  routes,  and  3  )  minor 
range  variations  from  a  given  site  does  not  lead  to  major  changes 
in  the  objects'  appearance  in  the  image  and  their  spatial 
distributions. 


432 


Images 


Fig.  1.  PRE ACTE's  top-level  approach  to  landmark  recognition. 


Fig.  1  illustrates  the  overall  approach  for  PREACTE's  landmark 
recognition  task.  It  is  a  top-down  expectation-driven  approach, 
whereby  an  Expected  Site  Model  (ESM)  of  the  map  is  generated 
based  on  domain-dependent  knowledge  of  the  current  (or 
projected)  location  of  the  vehicle  in  the  map  and  vehicle's 
velocity.  The  ESM  contains  models  of  the  expected  map  site 
and  its  landmarks.  This  expectation  provides  the  hypotheses  to 
be  verified  by  the  content  of  an  image  to  be  acquired  after  a 
computed  time  t,  given  the  velocity  of  the  vehicle  and  the 
distance  between  the  current  site  and  the  predicted  one.  While 
this  approach  may  seem  similar  to  other  hypothesis- verification 
concepts,  it  is  not  only  unique  by  its  added  expectations  but  also 
in  its  extensive  and  explicit  domain  specific  knowledge  which 
contributes  to  enhanced  performance.  Site  models  introduce 
spatial  constraints  on  the  locations  and  distributions  of 
landmarks,  by  using  a  "road"  model  as  a  reference.  Spatial 
constraints  greatly  reduce  the  search  space  while  attempting  to 
find  a  correspondence  between  the  image  regions  and  a  model. 
This  mapping  is  usually  many-to-one  in  complex  outdoor 
scenes,  because  of  imperfect  segmentation. 

In  the  segmented  image  each  region-based  feature  such  as  size, 
texiure,  color,  etc.  provides  an  independent  evidence  for  the 
existence  of  an  expected  landmark.  Evidence  accrual  is 
accomplished  by  an  extension  of  a  heuristic  Bayesian  formula 
[8),  which  will  be  discussed  in  Section  II. 3.  The  heuristic 
formula  is  used  to  compute  the  certainty  about  a  map  site 
location  based  on  the  certainty  of  the  previous  site  and  the 
evidences  of  each  landmark  existence  at  the  current  site.  Similar 
formulation  was  suggested  by  Lowe  [11|  for  evidential 
reasoning  for  visual  recognition. 


A.  Map/Landmark  Knowledge-Base 

Extensive  map  knowledge  and  landmarks  models  are 
fundamental  to  the  recognition  task.  Our  map  representation 
relies  heavily  on  declarative  and  explicit  knowledge  instead  of 
procedural  methods  on  relational  databases  (13].  The  map 
knowledge  is  represented  in  a  hierarchical  relational  network,  as 
illustrated  in  Fig.  2.  The  entire  map  is  divided  into  25  sectors  (5 


horizontally  and  5  vertically).  Each  sector  contains  four 
quadrants  which  in  turn  contain  a  number  of  surveyed  sites  (Fig. 
3).  All  map  primitives  are  represented  in  a  schema  structure. 
The  map  dimensions  are  characterized  by  their  cartographic 
coordinates.  Schema  representation  provides  an  object-oriented 
computational  environment  which  supports  the  inheritance  of 
different  map  primitives  properties  and  allows  modular  and 
flexible  means  for  searching  and  updating  the  map  knowledge 
base.  The  map  sites  between  which  the  vehicle  traverses  have 
been  surveyed  and  characterized  by  site  numbers.  An  aerial 

photograph  with  numbered  sites  is  shown  in  Fig.  3. 
Knowledge  acquired  about  these  sites  includes:  approximate^ 
latitude,  longitude,  elevation  >,  distance  between  sites,  terrain 
descriptions,  landmarks  labels  contained  in  a  site,  etc.  Such  site 
information  is  represented  in  a  SITE  schema,  with 
corresponding  slots,  as  illustrated  in  Fig.  2.  Slots  names 
include:  HAS_LANDMARKS,  NEXT_SITE,  LOCATION, 
S PATI AL.MODEL,  etc.  A  critical  slot  is  NEXT_SITE  which 
has  an  "active"  value.  By  active,  it  is  meant  that  it  is  dependent 
on  a  variable  (demon)  which  is  the  vehicle  direction  (North, 
South,  etc.).  For  different  traversal  directions  from  the  current 
site,  the  names  of  the  neighboring  sites  are  explicitly  declared  in 
the  NEXT-SITE  slot,  as  shown  in  Fig.  4.  The 
SPATIAL_MODEL  defines  the  "expectation  zone"  of  the 
landmarks  (in  the  image)  with  respect  to  the  road  and  with 
respect  to  each  others.  It  also  specifics  the  minimum  and 
maximum  distance  of  each  landmark  from  the  road  borderline. 
Each  landmark  is  represented  as  a  schema  or  a  collection  of 
schemas.  Each  landmark  is  represented  as  an  instance  of  a 
landmark-class  which,  in  turn,  is  an  instance  of  an  object-class. 
For  example,  T-POLE-17  is  an  instance  of  POLE,  which  is  an 
instance  of  MAN_MADE_OBJECTS.  Instances  in  this  case 
inherit  some  properties  and  declare  others. 

This  declarative  and  hierarchical  representation  of  the  knowledge 
allows  not  only  a  natural  conceptual  mapping,  but  also  a  flexible 
means  for  pattern  matching,  data  access,  tracing  of  the  reasoning 
process  and  maintenance  of  the  knowledge  base.  The  slots  and 
their  values  in  a  LANDMARK  schema  correspond  to  the 
landmark's  attributes  such  as  color,  texture,  shape,  geometric 
model,  etc.  The  landmark  attributes  are  characterized 


433 


Fig.  2.  Map  knowledge  representation  and  graphic  illustration 

of  the  approach  based  on  the  perception-reasoning-action 
and  expectation  paradigm. 


symbolically,  such  as  color  is  "black",  texture  is"  smooth”,  and 
shape  is  "elongated".  Each  attribute's  value  is  assigned  a 
likelihood  that  characterizes  its  discriminant  strength.  For 
example,  the  fact  that  poles  are  elongated,  place  a  high  likelihood 
value  (  0.8)  on  having  an  elongated  shape.  The  knowledge 
acquisition  for  modeling  each  landmark  in  the  knowledge  base  is 
performed  by  abstracting  and  characterizing  map  data  through 
actual  observations  and  measurements,  and  observations  of 
images  taken  at  individual  sites.  The  groundtruth  values  of  each 
landmark  attribute  are  obtained  from  a  combination  of  actual 
metrics,  and  approximations  of  features  extracted  from  hand 
segmented  images.  Three  dimensional  geometric  models  are 
represented  in  the  geometric-model  slot  of  a  LANDMARK 
schema.  Different  modeling  techniques  are  available  for 
different  landmarks.  For  example,  buildings  are  represented  as 
wire  frames,  while  poles  are  represented  as  generalized 
cylinders.  Thus  models  are  allowed  to  have  heterogeneous 
representations.  Image  description  is  obtained  by  projecting  the 
3-D  model  on  a  2-D  plane  using  perspective  transformations. 
This  hybrid  representational  framework  for  object  modeling 
provides  a  unique  ability  to  combine  different  types  of  object 
descriptions  (semantic,  geometric  and  relational).  This  in  turn 
allows  the  system  to  perform  more  robustly  and  efficiently,  and 
recover  from  a  single  bad  representation. 

B.  Prediction 

Given  the  a  priori  knowledge  of  the  vehicle's  current  location  in 
the  map  space  and  its  velocity,  it  is  possible  to  predict  the 
upcoming  site  that  will  be  traversed  through  the  explicit 
representation  of  the  map  knowledge  and  the  proper  control 
procedures.  The  Map/Landmark  Reasoner  (MLR)  provides 
such  control,  by  invoking  the  active  values  in  the  NEXT.SITE 


Fig.  3.  Aerial  photograph  of  the  map. 


slot  of  the  current  SITE  schema,  as  described  earlier.  The  ESM 
is  a  "provision”  by  the  MLR  to  make  the  expected  site  and  its 
corresponding  landmark  schemas  as  an  "active"  hypothesis  to  be 
verified.  In  parallel  to  predicting  the  next  site,  the  distance 
between  the  current  and  the  expected  site  along  with  the  vehicle 
velocity  are  used  to  predict  the  (arrival)  time  at  which  the 
sequence  of  images  should  be  processed  to  verify  the 
hypothesized  ESM.  Evidence  accrual  and  knowledge 
aggregation  is  dynamically  performed  between  sites  to  confirm 
arrival  time  at  the  predicted  site  [5].  The  ESM  of  SITE-110 
shown  in  Fig.  7  is  as  follows: 

(DEFSCHEMA  STTE-110 

(srre  no) 

(LOCATION  (392961.7  1050742.9)) 

(INSTANCE  -  OF  SITE) 

(STATUS  ACTIVE) 

(VIEW  FRONT) 

(HAS-LANDMARKS  (T-POLE-110  G-TANK-10  BLDO-UO)) 
(NECT-STTE  (E  109  0.255)  (W  111  0.153)) 

(SPATIAL-MODEL  SM-U0) 

(TERRAIN-TYPE  NIL)) 

(SETQ  SM-110  '( (T-POLE-110  G-TANK-U0  BLDO-UO) 

(T-POLE-110  (LEFT-OF  ROAD) 

(MINUDIST-ROAD  160) 
(MAX-L-DIST -ROAD  200) 

(O-TANK-110  (RK5HT-OF  ROAD) 

(MIN-R-DIST  200) 
(MAX-R-HST  250)) 

(BLDO-UO  (ABOVB  G-TANK-110) 

(RIGHT-OP  ROAD) 
(MIN-R-DIST  230) 
(MAX-R-DiST  280)  ))  ) 


434 


(DEFSCHEMA  SI  S2  S3 


(Next-Site  (E  S3  01) 

QN  S2  02) 

SI  < 

_ _ 1 

S4 

(S  SO  03)) 

• 

SO' 

•) 

Fig.  4.  Next-Site  slot  representation  in  a  SITE  schema  provides 
the  expected  site  and  distance  to  it  based  on  the  vehicle 
direction. 

Predictions  are  also  used  by  the  low-level  image  processing .  A 
priori  knowledge  of  the  objects'  attributes  which  will  appear  in 
the  image  and  their  relative  locations  guide  the  segmentation.  A 
rule-based  system  is  invoked  to  interpret  the  corresponding 
information  in  the  ESM,  which  results  in  properly  adapting  the 
segmentation  parameters  based  on  the  landmarks' 
distinguishing  attributes,  such  as  color ,  location,  texture, etc. 


C.  Image  Modeling 

An  image  model  for  the  task  of  landmark  recognition  is  a 
collection  of  regions-of-interest  extracted  by  a  region-based 
segmentation  method.  A  region-of-interest  for  this  task  is 
constrained  by  an  upper  and  lower  bound  on  its  size.  This 
means  that  after  performing  some  region  splitting  and  merging, 
most  of  the  very  small  and  the  very  large  regions  are  merged  or 
discarded  from  the  image.  In  addition,  regions-of-interest  do 
not  include  any  regions  that  represent  moving  objects  in  the 
scene  as  determined  by  the  motion  analysis  module.  A  number 
of  image  features  are  extracted  for  each  region,  such  as  color, 
length,  size,  perimeter,  texture.  Minimum  Bounding  Rectangle 
(MBR),  etc.,  as  well  as  some  derived  features  such  as 
elongation,  linearity,  compactness,  etc.  All  image  information  is 
available  in  the  blackboard  (Fig.  1),  which  is  a  superset  model 
of  all  the  results  collected  from  different  image  understanding 
modules.  The  landmark  recognition  system  operates  as  a 
"knowledge  source".  There  are  other  knowledge  sources  with 
other  tasks  such  as  object  recognition  (other  than  landmarks)  and 
motion  analysis.  The  blackboard  plays  the  role  of  a  central 
knowledge  structure  among  these  different  knowledge  sources. 
During  tire  process  of  extracting  regions-of-interest  for  landmark 
recognition  there  is  a  risk  of  ignoring  regions  in  the  image  that 
are  part  of  the  actual  landmarks.  These  regions  could  have  been 
split  to  very  small  regions  or  merged  with  very  large  ones, 
which  is  a  natural  outcome  of  the  inherently  weak  segmentation 
methods.  Symbolic  feature  extraction  is  performed  on  the 
region-based  features.  So,  instead  of  having  area  =  1500 
(pixels)  and  intensity  =  52,  we  could  have  area  =  large  and 
intensify  =  low.  The  symbolic  characterization  of  the  features 
using  "relative"  image  reformation  provides  a  better  abstraction 
of  the  image  and  a  framework  for  knowledge-based  reasoning. 
On  one  hand,  this  has  the  advantage  of  making  the  feature  space 
smaller,  therefore  easier  to  manipulate.  On  the  other  hand,  it 
makes  it  insensitive  to  feature  variations  in  the  image. 

Each  set  of  region  features  is  represented  in  a  schema  structure 
instead  of  a  feature  vector  as  in  pattern  recognition.  This 
schema  representation  of  regions  does  not  have  any  conceptual 
justifications,  however  it  provides  a  compatible  data  structure 
with  the  landmark  models  in  the  knowledge-base.  Most  of  the 
region  features  have  representative  attributes  in  the  landmarks 
models.  This  allows  symbolic  pattern  matching  to  be 
performed  easily  by  the  high-level  vision  knowledge  sources. 
Beyond  that,  it  makes  the  reasoning  process  more  traceable. 


Fig.  5.  Road  model  representation. 

A  critical  region  in  the  image  is  the  road  region,  which  is  user'  as 
a  reference  in  the  image  model.  Spatial  constraints  are  applied 
on  the  regions-of-interest  to  find  which  regions  in  the  image  fall 
to  the  left  and  to  the  right  of  the  road.  The  road  is  easily 
segmented  out  in  similar  imaging  scenarios,  using  current  state- 
of-the-art  road  segmentation  techniques  (currently  used  in  the 
ALV).  This  is  assuming  that  it  is  a  "structured"  road  (i.e., 
asphalt,  concrete,  etc.)  that  provides  good  contrast  (not  dirt 
roads).  The  road  is  represented  in  the  model  by  its  vertices  and 
the  approximate  straight  lines  of  the  left  and  right  borders,  as 
shown  in  Fig.  5.  For  each  region,  we  determine  the  position  of 
its  centroid  and  compute  the  shortest  distance  from  the  region  to 
the  road  border  line.  This  distance  is  compared  to  the  constraint 
imposed  on  each  landmark  by  the  site  spatial  model.  Thus  we 
obtain  the  following  top-level  structure  for  the  image  model: 

(<lM-#>  <frame-#>  <road-region-tag> 
<number-of-regions-of-interest> 

((<road-vertices>)  (<left-border>)  (<right-border>)) 
((<left-regions-list>)  (<right-regions-list>)))) 


D.  Hypothesis  Verification 

Given  the  expected  site  model  (ESM)  and  the  current  image 
model,  the  objectives  of  the  matching  and  verification  process 
are  two  fold:  (a)  to  label  the  regions  in  the  image  corresponding 
to  the  expected  landmarks,  and  (b)  to  determine  the  certainty 
level  of  the  predicted  map  site  location.  The  process  by  which 
the  first  objective  is  accomplished  is  as  follows:  1 .  find  the  set 
of  regions  (R)  in  the  image  model  (IM)  which  satisfy  the  spatial 
constraints  SC;  imposed  by  landmark  lj  in  the  ESM 
SPATIAL_MODEL.  This  constraint  application  yields  to  more 
than  one  corresponding  region  rj.  2.  Compute  the  evidence 
E(lj)  that  each  rj  in  (R)  yields,  using  the  FIND_EV1DENCE 
algorithm.  3.  The  rj  that  results  in  E(lj)  max  (provided  it  is  a 
positive  evidence)  is  considered  as  best  match  candidate  for  lj 
(there  may  be  more  than  one  given  that  their  values  surpass  a 
certain  threshold)  .  The  second  objective  is  achieved  by 
aggregating  the  individual  set  of  evidences  (E(lj)max)  and  the 
certainty  level  about  the  previous  map  site  location  and  the 
potential  error  introduced  by  range  and  the  view  angle  of  the 
camera. 

The  FIND_EVIDENCE  algorithm  considers  that  each  landmark 
lj  in  the  ESM  has  a  set  of  attributes  (Ajj,  ....  Ajj^,  ...,  A;n), 
each  with  a  likelihood  LHyj,  as  described  earlier.  Each  region  rj 
in  {Rj  has  a  set  of  features  {fjj, ...,  fj^, ...,  fjn } .  Note  that 
Aik  and  fft  correspond  to  the  same  thing  (in  the  model  and  the 
image),  such  as  color,  size,  texture,  etc.  Given  these  features, 
we  want  to  compute  the  evidence  that  lj  is  present  in  the  image. 


435 


Pff., . f, . £) 


(1) 


By  making  the  independence  assumption  among  features  in  a 
region  and  among  features  occurrence  in  images,  the  above 
equation  can  be  rewritten  as: 


pa/f  f  f 

'  j‘ .  * . V"  P(fj,)*...*P(fjk)*...*P(fjn) 


where  Wk  is  a  normalization  factor  between  0  and  1. 

Here  we  further  simplify  (4)  and  introduce  the  evidence  terms  E 
and  e  to  be  the  logarithm  of  P  and  I*W  respectively  .  So,  the 
evidence  formula  can  be  writttn  as  follows: 


where  n  is  the  number  of  features,  P(li)  is  the  initial  probability 
of  a  landmark  being  found  in  a  given  site.  For  now  this  is  set  to 
1  for  all  landmarks.  However,  P(lj)  is  actually  a  function  of  the 
certainty  level  about  the  previous  map  site  location,  navigational 
error  and  other  variables.  P(fjk)  is  the  probability  of  occurrence 
of  a  feature  in  an  image,  which  is  equal  to  l/(number  of  possible 
feature  values).  For  example,  if  texture  can  take  either  of  the 
four  values:  coarse,  smooth,  regular  or  irregular,  then  P 
(texture  =  smooth)  =  1/4.  Finally, 


fVi)  • 


“ik 

if  fjk=\ 

1-LH* 

if  fik*A, 

(21) 

d<fJk,Aik) 

which  is  best  explained  through  the  following  example: 

Given  two  regions  rj  and  in  the  image  with  different  sizes 
(fjk),  SIZE  (rj)  =  SMALL  and  SIZE  (r2)  =  LARGE.  Given  a 
model  of  landmark  L,  with  the  expected  size  to  be  LARGE 
(Aik),  w‘(h  a  likelihood  (LHik)  of  0.7.  The  SIZE  feature  can 
take  any  of  the  following  ordered  values:  {SMALL,  MEDIUM, 
LARGE).  If  r2  is  being  matched  to  L,  (2. 1 )  yields  to  0.7, 
because  fjk  =  Ajk-  On  the  other  hand,  if  r  j  is  being  matched  to 
L,  then  (2.1)  yields  to  (l-0.7)/2.  The  denominator  2  is  used 
because  LARGE  is  two  unit  distances  (denoted  by  d(.))  from 
SMALL. We  rewrite  (2)  as: 

n 

p¥i> . v . (3) 

k=l 

where  I(./.)  is  the  term  within  the  product  sign.  The  value  of 
I(fjk/lj)  can  be  greater  than  1,  because  the  heuristic  nature  of  the 
formulation  does  not  reflect  a  probabilistic  set  of  conditional 
events,  as  formulated  in  Bayes  theory.  Moreover,  P(lj/fjj...fjn) 
can  result  in  a  very  large  number  or  a  very  small  positive 
number. 

By  taking  the  logarithm  of  both  sides  of  (3),  introducing  Wj  as 
a  normalization  factor  for  each  feature,  and  dividing  by  the 
number  of  features  (n),  we  have  : 


. V . f„)]=N(P0,)l  +  — 


.LogtKyi)*^ 


(4) 


The  values  of  E(lj)  fall  between  0  and  1.  If  E(1  j)  >  0.6  it  means 
a  "positive"  set  of  evidences.  On  the  other  hand,  if  E(lj)  <  0.3  it 
is  interpreted  as  "negative"  evidence.  Otherwise,  E(lj)  is 
characterized  as  "neutral". 

An  important  characteristic  of  the  PREACTE  system  is  that  it 
utilizes  negative  as  well  as  positive  evidences  to  verify  its 
expectations.  There  are  many  types  of  negative  evidences  that 
could  be  encountered  during  the  hypothesis  generation  and 
verification  process.  The  one  that  is  of  particular  interest  to  us  is 
when  there  is  a  negative  evidence  about  a  "single"  landmark 
(E(lj)max  <0.3)  in  conjunction  with  positive  evidences  about  the 
other  landmarks  (average  evidence  >0.6)  and  a  reasonable  level 
of  certainty  about  the  previous  site  (Us.j<5)  (discussed  later). 
This  case  is  interpreted  as  caused  by  one  or  more  of  the 
following:  (a)  error  in  the  dimension  of  the  expectation  zone, 
(b)  bad  segmentation  results,  and  (c)  change  in  the  expected 
view  angle  or  range. 

In  such  a  case,  we  perform  the  following  steps:  1.  enlarge  the 
expectation  zone  by  a  fixed  margin,  and  find  the  evidences 
introduced  by  the  new  set  of  regions,  as  shown  in  Fig.  6.  2.  If 
step  1  fails  to  produce  an  admissible  set  of  evidences,  then  the 
expectation  zone  of  the  image  is  resegmented  using  a  new  set  of 
parameters  that  are  strictly  object  dependent. 


Fig.  6. 


Negative 
Evidence  Area 

New  Search 
Area 


New  search  area  as  a  result  of  negative  evidences. 


Even  though  landmark  recognition  is  introduced  to  assist  the 
autonomous  vehicle  land  navigation  system,  there  is  obviously 
uncertainty  attached  to  the  results  of  the  recognition  system.  We 
compute  the  uncertainty  Us  at  each  site  location  in  the  following 
manner 


where  Us_i  is  the  uncertainty  at  the  previous  site,  Uq,  the  initial 
uncertainty,  is  equal  to  1,  a  is  the  error  factor  introduced  by  the 
navigation  system,  it  is  set  to  a  constant  of  0.3  (for  experimental 
reasons),  and  E(lj)max  is  the  maximum  evidence  of  lj.  If  two  or 
more  regions  return  evidences  pea  ter  than  .8  then  the  average  is 
computed. The  value  0.S  is  used  (as  neutral  evidence)  to  stabilize 
the  function,  m  is  the  number  of  landmarks.  The  multiplicative 
nature  of  Us  provides  it  with  the  capability  of  rapidly  recovering 


436 


its  value  given  a  high  set  of  evidences  at  the  current  site  and  a 
high  level  of  uncertainty  at  the  previous  site. 


traveling  through  terrain  and  it  has  to  determine  precisely  where 
it  is  on  the  map  by  using  landmark  recognition. 


III.  RESULTS 


We  have  implemented  a  prototype  system  written  in  Common 
Lisp  and  ART  (Automated  Reasoning  Tool)  on  the  Symbolics 
3670.  The  image  processing  software  was  implemented  in  C  on 
the  VAX  11/750.  The  Symbolics  hosts  all  the  high-level 
(symbolic)  processing  software,  including  the  blackboard.  The 
map  and  landmarks  knowledge-base  is  implemented  as  a 
hierarchical  relational  network  of  schemas,  the  FIND- 
EVIDENCE  algorithm  is  implemented  in  Lisp.  The  Map- 
Landmark  Reasoner  is  implemented  in  a  rule-based  structure. 
An  initial  implementation  of  PREACTE  was  tested  on  a  video 
sequence  of  imagery.  Data  was  collected  at  30  frames/second  by 
a  camera  installed  on  top  of  a  vehicle  and  driven  on  the  road 
connecting  the  sites  shown  in  Fig.  3. 

The  system  was  easily  capable  of  predicting  the  next  site  and 
approximate  arrival  time  at  each  site.  In  an  experiment  where  we 
s" 'ted  at  Siit,- 109  and  traveled  west  at  10  kph,  arrival  time  was 
predicted  to  SITE-1 10,  at  245.4  seconds  (distance  between  the 
two  sites  is  .426  mile).  Fig.  7(a)  shows  the  image  taken  at 
SITE-1 10,  which  contains  a  pole  (T-POLE-1 10)  to  the  left  of  the 

road,  a  gas  tank  (G-TANK-1 10)  and  a  building  (BLDG-1 10)  to 
the  right.  Initial  segmentation  of  the  image  is  shown  in  Fig. 
7(b).  As  a  result  of  region  splitting  and  merging  and  discarding 
small  regions,  we  obtain  the  image  shown  in  Fig.  7(c). 
Rectangular  boxes  are  overlayed  over  the  regions  recognized  by 
PREACTE  as  landmarks,  based  on  the  hypotheses  generated  in 
the  ESM  of  SITE-110  (shown  in  Section  11.2).  The  hypothesis 
verification  results  are  shown  in  the  "match-evidence"  column  of 
Table  I  which  contains  a  listing  of  the  regions  yielding  the 
highest  evidences  and  a  subset  of  their  features  (other  features 
include  compactness,  intensity  variance,  etc.),  as  represented  in 
the  image  model.  The  spatial  constraint  specified  by  the  spatial 
model  (SM- 1 10)  yielded  to  a  small  number  of  regions-of-interest 
for  the  POLE  hypothesis,  as  a  result  of  the  successful  post¬ 
segmentation  effort.  More  regions-of-interest  were  considered 
as  candidates  for  the  other  landmarks.  The  road  (region  98)  in 
the  image  is  modeled  by  its  approximate  left  border  (y  =  -0.8x  + 
357.0)  and  its  right  border  (y  =  1.3x  -  126.7).  The  T-POLE- 
110  hypothesis  produced  two  regions  with  high  evidences. 
Since  a  threshold  of  0.8  was  used,  both  regions  30  and  79  are 
recognized  as  T-POLE-1 10.  The  lower  part  of  the  pole  (region 
79)  is  merged  with  some  ground  and  background  regions; 
nevertheless  it  still  resulted  in  a  higher  evidence  than  the  upper 
part  (region  30).  Currently  effort  is  underway  to  implement  a 
region  grouping  technique  based  on  evidences,  proximity,  size 
and  other  criteria.  The  lower  part  of  the  tank  was  broken  up  into 
six  small  regions  because  of  the  illumination  and  shape  factors. 
These  regions  were  included  in  the  hypothesis  verification  and 
they  produced  significantly  lower  evidences.The  uncertainty  : 
Un0=  (l+03)*((0.5)3/(0.875+0.62+0.70))  =0.43,  where 
0.875  is  the  average  of  0.92  and  0.83. 

IV.  CONCLUSIONS 


Fig.  7(a)  Image  obtained  at  site  1 10. 


Fig.  7(n)  Initial  segmentation  results. 


In  this  paper  we  have  presented  concepts  and  initial  results  of 
our  perception-reason  ing-action  and  expectation  paradigm  of  our 
ongoing  research  for  guiding  the  ALV  by  recognizing  landmarks 
along  die  sides  of  the  road.  In  the  future,  we  will  extend  to  a 
more  general  and  complex  situation  where  the  ALV  may  be 


437 


Fig.  7(c)  Segmented  image  after  region  splitting,  merging 
and  discarding  small  regions.  Regions  highlighted 
by  rectangles  indicate  the  result  of  landmark 
recognition  by  PREACTE. 


Peaio 

res 

Recognition  Results  | 

Region 

S*’e 

Cote' 

MBR 

Texture 

Elongation 

Shape 

Location 

Match- 

Evidence 

Landmark 

Hypothesis 

30 

Smar.Si; 

Black  j25  3.. 

(99  103.  61.  Ill) 

Smooth 

High  (50:4) 

Long  and 
Linear 

<101.2.  82.9) 

0.83 

T-POLE-1 10 

-3 

Small  ,1J4 

Biack  [27  3. 

il04  105.  112.  132) 

Irregular 

High  (201) 

Long  and 
Linear 

(104.6,  121.9) 

092 

T-POLE-1 10 

95 

Med.um  ,6~5. 

Wh  :e  (224  5, 

(445  510.  140.  155) 

Smooth 

Low  (05.15) 

Not  convex 

(482.  148.3) 

0.62 

G-TANK-1 10 

7C 

Medium  6-2i 

Gray t 159  7 

1469  510.  99.  127) 

Irregular 

Low  (41 28) 

Linear 

(490.2.115.  3) 

0.71 

BLDG-110 

Table  I.  Landmark  recognition  results. 


References 


[1]  B.  Bhanu,  "Recognition  of  Occluded  Objects,"  Proc.  of  the 
8th  International  Joint  Conference  on  Artificial  Intelligence, 
UCAI-83,  Karlsruhe,  West  Germany,  August  8-12,  1983, 
pp.  1136-1138. 

[2]  B.  Bhanu,  "Representation  and  Shape  Matching  of  3-D 
Objects,"  IEEE  Trans,  on  Pattern  Analysis  and  Machine 
Intelligence,  Vol.  PAMI-6,  May  1984,  pp.  340-351. 

[3]  B.  Bhanu  and  O.D.  Faugeras,"  Shape  Matching  of  Two- 
dimensional  Objects,"  IEEE  Trans,  on  Pattern  Analysis  and 
Machine  Intelligence,  Vol.  PAMI-6,  March  1984,  pp.  1 37- 
156. 

[4]  B.  Bhanu  and  T.  Henderson,"  CAGD  Based  3-D  Vision," 
IEEE  International  Conference  on  Robotics  and  Automation, 
March  1985,  pp.  411-417. 


[5]  B.  Bhanu  and  W.  Burger,  "DRIVE:  Dynamic  Reasoning 
Using  Integrated  Visual  Evidences,"  Proc.  DARPA  Image 
Understanding  Workshop,  University  of  Southern 
California,  Feb.  1987. 

[6]  T.O.  Binford,  "Survey  of  Model-Based  Image  Analysis," 
The  International  Journal  of  Robotics  Research,  Vol.  1, 
Spring  1982,  pp.  18-64. 

[7]  R.  A.  Brooks,  "Symbolic  Reasoning  Among  3-D  Models 
and  2-D  Images,"  Artificial  Intelligence,  Vol.  17,  pp.  285- 
348,  1981. 

[81  E.  Charniak,  "The  Bayesian  Basis  of  Common  Sense 
Reasoning  in  Medical  Diagnosis,"  Proc.  American 
Association  of  Artificial  Intelligence  Conference  1983, 
AAAI-83,  pp.  70-73. 

[9]  E.  Davis,  Representing  and  Acquiring  Geographic 
Knowledge,  Morgan  Kaufman  Publishers,  Inc.  1986. 


[10]  S.V.  Hwang,  "Evidence  Accumulation  for  Spatial 
Reasoning  in  Aerial  Image  Understanding,"  Ph.D.  Thesis, 
Dept,  of  Computer  Science,  University  of  Maryland, 
College  Park,  Maryland,  1984. 

[1 1]  D.  Lowe,  Perceptual  Organization  and  Visual  Recognition, 
Kluwer  Publishing  Co.,  1985. 

[12]  A.R.  Hanson  and  E.M.  Riseman,  "VISIONS:  A  Computer 
System  for  Interpreting  Scenes,"  in  Computer  Vision 
Systems,  A.R.  Hanson  and  E.M.  Riseman,  (Eds.),  New 
York,  Academic  Press,  1978,  pp.  303-333. 

[13]  D.M.  McKeown,  Jr.,  W.A.  Harvey,  Jr.,  and  J. 
McDermott,  "Rule-Based  Interpretation  of  Aerial  Imagery," 
IEEE  Trans,  on  Pattern  Analysis  and  Machine  Intelligence, 
Vol.  P AMI-7,  September  1985,  pp.  570-585. 

[14]  M.  Nagao  and  T.  Matsuyama,  A  Structural  Analysis  of 
Complex  Aerial  Photographs,  Plenum  Press,  1980. 


439 


The  CMU  Navigational  Architecture 


Anthony  Stentz 
Yoshimasa  Goto 

The  Robotics  Institute 
Camegie-Mellon  University 
Pittsburgh,  PA  15213 


Abstract 


This  paper  describes  the  current  status  of  the  Autonomous  Land 
Vehicle  research  at  Carnegie-Mellon  University's  Robotics  Institute, 
focusing  primarily  on  the  system  architecture.  We  begin  with  a 
discussion  of  the  issues  concerning  outdoor  navigation,  then  describe 
the  various  perception,  planning,  and  control  components  of  our 
system  that  address  these  issues.  We  describe  the  CODGER 
software  system  for  integrating  these  components  into  a  single 
system,  synchronizing  the  data  flow  between  them  in  order  to 
maximize  parallelism.  Our  system  is  able  to  drive  a  robot  vehicle 
continuously  with  two  sensors,  a  color  camera  and  a  laser 
rangefinder,  on  a  network  of  sidewalks,  up  a  bicycle  slope,  and 
through  a  curved  road  through  an  area  populated  with  trees.  Finally, 
we  discuss  the  results  of  our  experiments,  as  well  as  problems 
uncovered  in  the  process  and  our  plans  for  addressing  them. 


1 .  Introduction 

The  goal  of  the  Autonomous  Land  Vehicle  group  at  Camegie- 
Mellon  University  is  to  create  an  autonomous  mobile  robot  system 
capable  of  operating  in  outdoor  environments.  Because  of  the 
complexity  of  real-world  domains  and  the  requirement  for  continuous 
and  real-time  motion,  such  a  robot  system  needs  system  architectural 
support  for  multiple  sensors  and  parallel  processing.  These 
capabilities  are  not  found  in  simpler  robot  systems.  At  CMU,  we  are 
studying  mobile  robot  system  architecture  and  have  developed  the 
navigation  system  working  at  two  test  sites  and  on  two  experimental 
vehicles  [2]  [3]  [4]  [8]  [10]  [11].  This  paper  describes  current  status  of 
our  system  and  some  problems  uncovered  through  real  experiments. 


1.1.  The  Test  Sites  and  Vehicles 

We  have  two  test  sites,  the  Camegie-Mellon  University  campus  and 
an  adjoining  park,  Schenley  Park.  The  CMU  campus  test  site  has  a 
sidewalk  network  including  intersections,  stairs  and  bicycle  slopes 
(see  Figure  1).  The  Schenley  Park  test  site  has  curved  sidewalks  in 
an  area  well  populated  with  trees  (see  Figure  2). 

Figure  3  shows  our  two  experimental  vehicles,  the  NAVLAB  used  in 
the  Schenley  Park  test  site,  and  the  Terregator  used  in  the  CMU 
campus  test  site.  Both  of  them  are  equipped  with  a  color  TV  camera 
and  a  laser  rangefinder  made  by  ERIM.  The  NAVLAB  carries  four 
general  purpose  computers  (SUN-3s)  on  board.  The  Terregator  is 
linked  to  SUN-3s  in  the  laboratory  with  radio  communication.  All  of  the 
SUN-3s  are  interconnected  with  a  EtherNet.  Our  navigation  system 
works  on  both  vehicles  in  each  test  site. 


'Thi»  research  was  supported  by  (he  Strategic  Computing  Initiative  ot  the  Defense 
Advanced  Research  Project  Agency.  DoD,  through  ARPA  Order  5351,  and  monitored 
by  the  U  S  Army  Engineer  Topographic  Laboratories  under  contract  DACA76-S5- 
C  0003  Views  and  conclusions  contained  In  this  document  are  hose  of  the  authors 
and  should  not  be  interpreted  as  representing  official  policies,  either  expressed  or 
implied,  of  the  Defense  Advanced  Research  Projects  Agency  or  the  United  State 
Government 


1.2.  Current  System  Capabilities 

Currently,  the  system  has  the  following  capabilities. 

•  Able  to  execute  a  prespecified  user  mission  over  a 
mapped  network  ot  sidewalks,  including  turning  at  the 
intersections  and  driving  up  the  bicycle  slope. 

•  Able  to  recognize  landmarks,  stairs  and  intersections. 

•  Able  to  drive  on  unmapped,  curved,  ill-defined  roads 
using  assumptions  about  local  road  linearity. 

•  Able  to  detect  obstacles  and  stop  until  they  move  away 

•  Able  to  avoid  obstacles 

•  Able  to  drive  continuously  at  200mm/sec. 


Figure  1 :  Map  of  the  CMU  Campus  Test  Site 


Figure  2:  Map  ot  the  Schenley  Park  Test  Site 

2.  Design  of  the  System  Architecture 

In  this  section  we  describe  the  goals  ot  our  outdoor  navigation 
system  and  the  design  principles,  followed  by  an  analysis  ot  tbe 
outdoor  navigation  task  itself  We  describe  our  system  architecture  as 
It  is  shaped  by  these  principles  and  analysis. 


440 


NAVLAB 


Terregator 


Figure  3:  The  NAVLAB  and  Terregator 


2.1 .  Design  Goals  and  Principles 

The  goals  ol  our  outdoor  navigation  system  are: 

•  map-driven  mission  execution:  The  system  drives  the 
vehicle  to  reach  a  given  goal  position. 

•  on-  and  off-road  navigation:  Navigation  environments 
include  not  only  roads  but  also  open  terrain. 

•  landmark  recognition:  Landmark  sightings  are  essential 
in  order  to  correct  for  drift  in  the  vehicle's  dead-reckoning 
system. 

•  obstacle  avoidance 

•  continuous  motion  In  real  time:  Stop  and  go  motion  is 
unacceptable  tor  our  purposes.  Perception,  planning, 
and  control  should  be  carried  out  while  the  vehicle  is 
moving  at  a  reasonable  speed 

In  order  to  satisfy  these  goals,  we  have  adopted  the  following 
design  principles. 

•  sensor  fusion:  A  single  sensor  is  not  enough  to  analyze 
complex  outdoor  environments  Sensors  include  not  only 
a  TV  camera  and  a  range  sensor  but  also  an  inertial 
navigation  sensor,  a  wheel  rotation  counter,  etc. 

•  parallel  execution:  In  order  to  process  data  from  a 
number  of  sensors,  make  global  and  iocal  plans  ,  and 
drive  the  vehicle  in  real-time,  parallelism  is  essential. 

•  flexibility  and  extensibility:  This  principle  is  essential 
because  the  whole  system  is  quite  large,  requiring  the 
integration  of  a  wide  range  ol  modules 


2.2.  Outdoor  Navigation  Tasks 

Outdoor  navigation  includes  several  different  navigation  modes. 
Figure  4  illustrates  several  examples.  On-road  vs.  off-road  is  just  one 
example.  Even  in  on-road  navigation,  turning  at  the  intersection 
requires  more  sophisticated  driving  skill  than  following  the  road.  In 
road  following,  the  assumption  that  the  ground  is  flat  makes 
perception  easier,  but  driving  through  the  forest  does  not  satisfy  this 
assumption  and  requires  more  complex  perception  processing. 

According  to  this  analysis  we  decompose  outdoor  navigation  into 
two  navigation  levels:  global  and  local.  At  the  global  level,  the  system 
tasks  are  to  select  the  best  navigation  route  to  reach  the  destination 
given  by  a  user  mission,  and  to  divide  whole  route  into  a  sequence  of 
route  segments,  each  corresponding  to  a  uniform  driving  mode.  The 
current  system  supports  the  following  navigation  modes:  following  the 
road,  turning  at  the  intersection,  driving  up  the  slope. 

Local  navigation  involves  driving  within  a  single  route  segment. 


Figure  4:  Outdoor  Navigation 


The  navigation  mode  is  uniform  and  the  system  drives  the  vehicle 
along  the  route  segment  continuously,  perceiving  objects,  planning 
path  plans,  and  controlling  the  vehicle.  The  important  thing  is  that 
these  tasks,  perception,  planning,  and  control,  form  a  cycle  and  can 
be  executed  concurrently. 


2.3.  System  Architecture 

Figure  5  is  a  block  diagram  of  our  system  architecture  The 
architecture  consists  of  several  modules  and  a  communications 
database  which  links  the  modules  together. 


2.3.1 .  Module  Structure 

In  order  to  support  the  tasks  described  in  the  previous  section,  we 
first  decomposed  the  whole  system  into  the  following  modules: 

•  CAPTAIN  executes  user  mission  commands  and  sends 
the  destination  and  the  constraints  of  each  mission  step 
to  the  MAP  NAVIGATOR  one  step  at  a  time,  and  gets  the 
result  of  mission  step. 

•  MAP  NAVIGATOR  selects  the  best  route  by  searching 
the  Map  Database,  decomposes  It  Into  a  sequence  of 
route  segments,  generates  a  route  segment  description 
which  Includes  objects  from  the  Map  vtsfole  from  the 
route  segment,  and  sends  It  to  the  PILOT. 


441 


Figure  5:  System  Architecture 

•  PILOT  coordinates  the  activities  ol  PERCEPTION  and 
the  the  HELM  to  perform  local  navigation  continuously 
within  a  single  route  segment. 

• PERCEPTION  uses  sensors  to  find  objects  predicted  to 
lie  within  the  vehicle's  field  of  view.  It  estimates  the 
vehicle's  position  if  possible. 

•  HELM  gets  the  local  path  plan  generated  by  the  PILOT 
and  drives  the  vehicle. 

The  PILOT  is  decomposed  into  several  submodules  which  run 
concurrently  (see  Figure  6). 

•  DRIVING  MONITOR  decomposes  the  route  segment  into 
small  pieces  called  driving  units.  A  driving  unit  is  the 
basic  unit  for  perception,  planning,  and  control  processing 
at  the  local  navigation  level.  For  example,  PERCEPTION 
must  be  able  to  process  a  whole  driving  unit  with  a  Single 
image.  The  DRIVING  MONITOR  creates  a  driving  unit 
description  ,  which  describes  objects  in  the  driving  unit, 
and  sends  it  to  the  following  submodules. 

•  DRIVING  UNIT  FINDER  functions  as  an  interface  to 
PERCEPTION,  sending  the  driving  unit  description  to  it 
and  getting  the  result  from  it. 


•  POSITION  ESTIMATOR  estimates  the  vehicle  position 
using  both  of  the  result  of  PERCEPTION  and  dead¬ 
reckoning. 

•  DRIVING  UNIT  NAVIGATOR  determines  the  admissible 
passage  in  which  to  drive  the  vehicle. 

•  LOCAL  PATH  PLANNER  generates  the  path  plan  within 
the  driving  unit,  avoids  obstacles  and  keeps  the  vehicle  in 
the  admissible  passage.  The  path  plan  is  sent  to  the 
HELM. 


2.3.2.  CODGER 

The  second  problem  in  the  system  architecture  design  is 
connecting  the  modules.  Based  on  our  design  principles,  we  have 
created  a  software  system  called  CODGER  (Communications 
Database  with  GEometric  Reasoning)  which  supports  parallel 
asynchronous  execution  and  communication  between  the  modules. 
We  describe  CODGER  in  detail  in  the  next  section. 


3.  Parallelism 


3.1.  The  CODGER  System  for  Parallel  Processing 

In  order  to  navigate  in  real-time,  we  have  employed  parallelism  in 
our  perception,  planning,  and  control  subsystems.  Our  computing 
resources  consist  of  several  SUN-3  microcomputers,  VAX 
minicomputers,  and  a  high-speed,  parallel  processor  known  as  the 
WARP  interconnected  with  an  EtherNet.  We  have  designed  and 
implemented  a  software  system  called  CODGER  (Communications 
Database  with  GEometric  Reasoning)  [9]  to  effectively  utilize  this 
parallelism. 

The  CODGER  system  consists  of  a  central  database  ( Local  Map),  a 
process  that  manages  this  database  (Local  Map  Builder  or  LMB ),  and 
a  library  of  functions  for  accessing  the  data  (LMB  interface)  (see 
Figure  7).  The  various  perceptual,  planning,  and  control  modules  in 
the  system  are  compiled  with  the  LMB  interface  and  invoke  functions 
to  store  and  retrieve  data  from  the  central  database.  The  CODGER 
system  can  be  run  on  any  mix  of  SUN-3s  and  VAXes  and  handles 
data  type  conversions  automatically.  This  system  permits  highly 
modular  development  requiring  recompilation  only  for  modules  directly 
affected  by  a  change. 


FlguraS:  Submodule  Structure  of  the  PILOT 


442 


3.1.1.  Data  Representation 

Data  in  the  Local  Map  is  represented  in  tokens  consisting  of  lists  of 
attribute-value  pairs.  Tokens  can  be  used  to  represent  any 
information  including  physical  objects,  hypotheses,  plans,  commands, 
and  reports.  The  token  types  are  defined  in  a  template  file  which  is 
read  by  the  LMB  at  system  startup  time.  Attribute  types  may  be  the 
usual  scalars  (e.g.,  floats,  integers),  sets  of  scalars,  or  geometric 
locations.  Geometric  locations  consist  of  a  two-  dimensional, 
polygonal  shape  and  a  reference  coordinate  frame.  The  CODGER 
system  provides  mechanisms  for  defining  coordinate  frames  and  for 
automatically  converting  geometric  data  from  one  frame  to  another, 
thereby  allowing  modules  to  retrieve  data  from  the  database  and 
representing  it  in  a  form  meaningful  to  them.  Geometric  data  is  the 
only  data  interpreted  by  the  CODGER  system;  the  inti-rpretation  of  all 
other  data  types  is  delegated  to  the  modules  that  use  them. 


3.1.2.  Synchronization 

The  LMB  interface  provides  functions  for  storing  and  retrieving  data 
from  the  central  database.  Tokens  can  be  retrieved  using 
specifications.  Specifications  are  simply  boolean  expressions 
evaluated  across  token  attribute  values.  A  specification  may  include 
computations  such  as  mathematical  expressions,  boolean  relations, 
and  comparisons  between  attribute  values.  Geometric  indexing  is  of 
particular  importance  for  a  mobile  robot  system.  For  example,  the 
planner  needs  to  search  a  database  of  map  objects  to  locate  suitable 
landmarks  or  to  find  the  shortest  path  to  the  goal.  The  CODGER 
system  provides  a  host  of  functions  including  those  for  computing  the 
distance  and  intersection  of  locations.  These  functions  can  be 
embedded  in  specifications  and  matched  to  the  database. 

The  CODGER  system  has  a  set  of  primitives  to  ensure  that  data 
transfer  between  system  modules  is  synchronized  and  runs  smoothly. 
The  synchronization  is  implemented  in  the  data  retrieval  mechanism 
Specifications  are  sent  to  the  LMB  as  either  one-shot  or  standing 
requests  For  one-shot  specs,  the  calling  module  blocks  while  the 
LMB  matches  the  spec  to  the  tokens.  Tokens  that  match  are  retrieved 
and  the  module  resumes  execution.  If  no  tokens  match,  either  the 
module  stays  blocked  until  a  matching  token  appears  in  the  database 
or  an  error  is  returned  and  the  module  resumes  execution,  depending 
on  an  option  specified  in  the  request.  For  example,  the  PATH 
PLANNER  may  use  a  one-shot  to  find  obstacles  stored  in  the 
database  before  it  can  plan  a  path.  In  contrast,  the  HELM,  which 
controls  the  vehicle,  uses  a  standing  spec  to  retrieve  tokens  supplying 
steering  commands  whenever  they  appear. 


3.2.  Parallel  Asynchronous  Execution  ot  Modules 

Thus  far  we  have  run  our  scenarios  with  four  SUN-3s 
interconnected  with  an  EtherNet.  The  CAPTAIN,  MAP  NAVIGATOR, 
PILOT,  and  HELM  are  separate  modules  in  the  system,  and 
PERCEPTION  is  two  modules  (range  and  camera  image  processing). 
All  of  the  modules  run  in  parallel;  they  synchronize  themselves 
through  the  LMB  database. 


3.2.1.  Global  and  Local  Navigation 

A  good  example  ol  parallelism  in  the  system  is  the  interaction 
between  the  CAPTAIN,  MAP  NAVIGATOR,  and  PILOT.  The 
CAPTAIN  and  MAP  NAVIGATOR  search  the  map  database  to  plan  a 
global  path  for  the  vehicle  in  accordance  with  the  mission 
specification.  The  PILOT  coordinates  PERCEPTION,  PATH 
PLANNING,  and  control  through  the  HELM  to  navigate  locally.  The 
global  and  local  navigation  operations  run  in  parallel.  The  MAP 
NAVIGATOR  monitors  the  progress  of  the  PILOT  to  ensure  that  the 
PILOT'S  transition  from  one  route  segment  to  the  next  occurs 
smoothly. 


3.2.2.  Driving  Pipeline 

Another  good  example  ol  parallelism  is  within  the  PILOT  itself.  As 
described  earlier,  the  PILOT  monitors  local  navigation.  For  each 
driving  unit,  the  PILOT  performs  four  operations  In  the  following  order: 
predict  it,  recognize  with  the  camera  and  scan  it  for  obstacles  with  the 
rangefinder,  establish  driving  constraints  and  plan  a  path  through  it, 
and  oversee  the  vehicle's  execution  of  It.  In  the  PILOT,  these  four 
operations  are  separate  modules  linked  together  In  a  pipeline.  While 
in  steady  state,  the  PILOT  is  predicting  a  driving  unit  12  to  16  meters 


in  front  of  the  vehicle,  recognizing  a  driving  unit  and  scanning  it  for 
obstacles  (in  parallel)  8  to  1 2  meters  in  front,  planning  a  path  4  to  8 
meters  in  front,  and  driving  to  a  point  4  meters  in  front.  The  stages  of 
the  pipeline  synchronize  themselves  tnrough  the  CODGER  database. 

The  processing  times  for  each  stage  vary  as  a  function  of  the 
navigation  task.  In  navigation  on  uncluttered  roads,  the  vision 
subsystem  requires  about  10  seconds  of  real-time  per  image,  the 
range  subsystem  requires  about  6  seconds,  and  the  local  path 
planner  requires  less  than  a  second.  In  this  case,  the  stage  time  of 
the  pipeline  is  that  of  the  vision  subsystem:  10  seconds.  In  cluttered 
environments,  the  local  path  planner  may  require  10  to  20  seconds  or 
more,  thereby  becoming  the  bottleneck.  In  either  case,  the  vehicle  is 
not  permitted  to  drive  on  to  a  driving  unit  unit  it  has  propagated 
through  all  stages  of  the  pipeline  (i.e.,  all  operations  have  been 
performed  on  it).  For  example,  when  driving  around  the  comer  of  a 
building,  the  vision  stage  must  wait  until  the  vehicle  reaches  the 
corner  in  order  to  see  the  next  driving  unit.  Once  the  vehicle  reaches 
the  corner,  it  must  stop  while  waiting  for  the  vision,  scanning,  and 
planning  stages  to  process  the  driving  unit  before  driving  again. 


4.  Sensor  Fusion 


4.1.  Types  of  Sensor  Fusion 

The  NAVLAB  and  Terregator  vehicles  are  equipped  with  a  host  of 
sensors  including  color  cameras,  a  laser  rangefinder,  and  motion 
sensors  such  as  a  gyro  and  shaft-encoder  counter.  In  order  to  obtain 
a  single,  consistent  interpretation  of  the  vehicle’s  environment,  the 
results  of  these  sensors  must  be  fused.  We  have  identified  three 
types  of  sensor  fusion  [8]: 

•  Competitive:  Sensors  provide  data  that  either  agrees  or 
conflicts.  This  case  arises  when  sensors  provide  data  of 
the  same  modality.  In  the  CMU  systems,  the  task  of 
determining  the  vehicle's  position  best  characterizes  this 
type  of  fusion.  Readings  from  the  vehicle's  dead¬ 
reckoning  system  as  well  as  landmark  sightings  provide 
estimates  of  the  vehicle's  position. 

•  Complementary:  Sensors  provide  data  of  different 
modalities.  The  task  of  recognizing  three-dimensional 
objects  illustrates  this  kind  of  fusion.  In  the  CMU 
systems,  a  set  of  stairs  is  recognized  using  a  color 
camera  and  laser  rangefinder.  The  color  camera 
provides  image  information  (e  g.,  color  and  texture)  while 
the  laser  rangefinder  provides  three-dimensional 
information. 

•  Independent:  A  single  sensor  is  used  for  each  task.  An 
example  of  a  task  requiring  a  single  sensor  is  distant 
landmark  recognition.  In  this  case,  only  the  camera  is 
used  for  landmarks  beyond  the  range  of  the  laser 
rangefinder. 


4.2.  Examples  of  Sensor  Fusion  Tasks 


4.2.1.  Vehicle  Position  Estimation 

In  our  road  following  scenarios,  vehicle  position  estimation  has 
been  the  most  important  sensor  fusion  task.  By  vehicle  position,  we 
mean  the  position  and  orientation  ol  the  vehicle  in  the  ground  plane  (3 
degrees  of  freedom)  relative  to  the  work)  coordinate  frame.  In  the 
current  system,  there  are  two  sources  of  position  information.  First, 
dead-reckoning  provides  vehicle-based  position  information.  The 
CODGER  system  maintains  a  history  of  the  steering  commands 
issued  to  the  vehicle,  effectively  recording  the  trajectory  of  the  vehicle 
from  its  starting  point. 

Second,  landmark  sightings  directly  pinpoint  the  position  of  the 
vehicle  with  respect  to  the  world  at  a  point  in  time.  In  the  campus  test 
site,  the  system  has  access  to  a  complete  topographical  map  of  the 
sidewalks  and  intersections  on  which  It  drives.  The  system  uses  a 


443 


color  camera  to  sight  the  intersections  and  sidewalks  and  uses  these 
sightings  to  correct  the  estimate  of  the  vehicle's  position.  The 
intersections  are  of  rank  three,  meaning  that  the  position  and 
orientation  of  the  vehicle  with  respect  to  the  intersection  can  be 
determined  fully  (to  three  degrees  of  freedom)  from  the  sighting.  Our 
tests  have  shown  that  such  landmark  sightings  are  far  more  accurate 
but  less  reliable  than  the  current  dead  reckoning  system,  that  is, 
landmark  sightings  provide  more  accurate  vehicle  position  estimates ; 
however,  the  sightings  occasionally  fail.  If  the  vehicle  position 
estimates  from  the  sighting  and  dead-reckoning  disagree  drastically, 
the  conflict  is  settled  in  favor  of  the  dead-reckoning  system;  otherwise, 
the  result  from  the  landmark  sighting  is  used.  In  this  case,  the 
CODGER  system  adjusts  its  record  of  the  vehicle's  trajectory  so  that  it 
agrees  with  the  most  recent  landmark  sighting,  and  discards  all 
previous  sightings. 

The  CODGER  system  is  able  to  handle  landmark  sightings  of  rank 
less  than  three.  The  most  common  "landmark”  in  our  scenarios  is  the 
sidewalk  on  which  the  vehicle  drives.  Since  a  sidewalk  sighting 
provides  only  the  orientation  and  perpendicular  distance  of  the  vehicle 
with  respect  to  the  sidewalk,  the  correction  is  of  rank  two  Therefore, 
the  position  of  the  vehicle  is  constrained  to  lie  on  a  straight  line.  The 
CODGER  system  projects  the  position  ol  the  vehicle  from  dead¬ 
reckoning  onto  this  line  and  uses  the  projected  point  as  a  full  (rank 
three)  correction.  Since  most  of  the  error  in  the  vehicle's  motion  is 
lateral  drift  from  the  road,  this  approximation  works  well 


4.2.2.  Pilot  Control 

Complementary  fusion  is  grounded  in  the  Pilot's  corn-  ,  .unctions. 
The  Pilot  ensures  that  the  vehicle  travels  only  where  f  s  permitted 
and  where  it  can.  For  example,  the  color  camera  is  us>  to  segment 
road  from  nonroad  surfaces  The  laser  rangefinder  see  he  area  in 
front  of  the  vehicle  for  obstacles  or  unnavigable  (i.e. .  rc  ,gi.  or  steep) 
terrain.  The  road  surface  is  fused  with  the  tree  space  and  is  passed 
to  the  local  path  planner.  Since  the  two  sensor  operations  do  not 
necessarily  occur  at  the  same  time,  the  vehicle's  dead-reckoning 
system  also  comes  into  play 


4.2.3.  Colored  Range  Image 

Another  example  of  complementary  fusion  of  camera  and  range 
data  is  the  colored  range  image.  A  colored  range  image  is  created  by 
"painting"  a  color  image  onto  the  depth  map  of  a  range  image  The 
resultant  image  is  used  in  our  systems  to  recognize  complicated  three 
dimensional  objects  such  as  a  set  of  stairs  In  order  to  avoid  the 
relatively  large  error  in  the  vehicle's  dead-reckoning  system,  the 
vehicle  remains  motionless  while  digitizing  a  corresponding  pair  of 
camera  and  range  images  [2] 


4.3.  Problems  and  Future  Work 

We  have  plans  for  improving  our  sensor  fusion  mechanisms 
Currently,  the  CODGER  syslem  handles  competing  sensor  data  by 
retaining  the  most  recent  measurement  and  discarding  all  others 
This  is  undesirable  for  the  following  reasons  First  a  single  bad 
measurement  (e  g.,  landmark  sighting)  can  easily  ttirow  the  vehicle  otf 
track.  Second,  measurements  can  reinforce  each  other  By 
discarding  old  measurements,  useful  information  is  lost  A  weighting 
scheme  is  needed  for  combining  competing  sensor  data  In  many 
cases,  it  is  useful  to  model  error  in  sensor  data  as  gaussian  noise 
For  example,  error  in  dead-reckoning  may  arise  from  random  error  in 
the  wheel  velocities.  Likewise  quantization  error  in  range  and  camera 
images  can  be  modeled  as  gaussian  noise  A  number  of  schemes 
exist  for  fusing  such  data  ranging  from  simple  Kalman  littering 
techniques  to  full-blown  Bayesian  observation  networks  [i]  |7) 


5.  Local  Control 

In  this  section  we  discuss  some  of  the  control  problems  in  local 

navigation 


5.1.  Adaptive  Driving  Units  and  Sensor  View  Frames 
Management  of  driving  units  and  sensor  view  frames  is  essential  in 
local  control.  As  described  in  section  2,  the  driving  unit  is  a  minimum 
control  unit,  a  unit  to  perceive  objects,  generate  a  path  plan,  and  drive 


the  vehicle.  The  PERCEPTION  module  digitizes  an  image  in  each 
driving  unit,  and  the  vehicle's  position  is  estimated  and  its  trajectory  is 
planned  once  in  each  driving  unit.  Therefore,  an  appropriate  driving 
unit  size  is  essential  for  stable  control.  For  example,  the  sensor  view 
frame  cannot  cover  a  very  large  driving  unit.  Conversely,  small  driving 
units  place  rigid  constraints  on  the  LOCAL  PATH  PLANNER,  because 
of  the  short  distance  between  the  starting  point  and  the  goal  point. 
The  aiming  of  the  sensor  view  frame  determines  ihe  point  at  which  to 
digitize  an  image  and  to  update  the  vehicle  position  and  path  plan. 

In  the  current  system,  the  sensor  view  frame  is  always  fixed  with 
respect  to  the  vehicle.  The  size  of  the  driving  unit  is  fixed  for  driving 
on  roads  (4-6  meters  length),  and  is  changed  for  turning  at 
intersections  so  that  the  entire  intersection  can  be  see  in  a  single 
image  and  to  increase  driving  stability  (see  Figure  8).  This  method 
works  well  in  almost  all  situations  in  current  test  site. 


For  intersections  requiring  sharp  turns  (about  135  degrees),  the 
current  method  does  not  suffice.  Because  there  is  only  one  driving 
unit  at  the  intersection,  the  system  digitizes  an  image,  estimates  the 
vehicle's  position,  and  generates  a  path  plan  only  once  for  a  large 
turn.  Furthermore,  since  the  camera's  field  of  view  is  fixed  straight 
ahead,  the  system  cannot  see  the  driving  unit  after  the  intersection 
until  the  vehicle  has  turned  through  the  intersection.  Though  actual 
path  generated  is  not  so  bad,  it  is  potentially  unstable. 

This  experimental  result  indicates  that  the  system  should  scan  for 
an  admissible  passage,  and  update  vehicle  position  estimation  and 
local  path  plan  more  frequently  when  the  vehicle  changes  its  course 
faster.  We  plan  to  improve  our  method  for  managing  driving  units. 
Our  new  idea  is: 

•  Length  of  the  driving  unit:  The  length  of  the  driving  unit 
is  bounded  at  the  low  end  by  the  LOCAL  PATH 
PLANNER'S  requirements  for  generating  a  reasonable 
path  plan,  and  at  the  high  end  by  the  view  frame  required 
by  PERCEPTION  for  recognizing  a  given  object. 

•  driving  unit  Interval:  The  driving  unit  interval  is  the 
distance  between  the  centers  of  adjacent  driving  units. 
Adjacent  driving  units  can  be  overlapped,  that  is,  they  can 
be  placed  such  that  their  interval  is  shorter  than  their 
length.  Figure  9  illustrates  this  situation. 

•  adjusting  size  and  interval  of  driving  unit:  if  the 

passage  is  simple,  the  length  and  interval  of  the  driving 
unit  is  long.  If  the  passage  is  complex,  for  example,  in 
the  case  of  highly  curved  roads  or  intersections,  or  in  the 
presence  of  obstacles,  the  length  and  interval  of  driving 
unit  are  shorter.  And  if  the  required  driving  unit  interval 
must  be  shorter  than  the  length  of  driving  unit,  the  driving 
units  are  overlapped.  Therefore,  the  vehicle's  position  is 
estimated  and  a  local  path  is  planned  more  frequently  so 
that  the  vehicle  drives  stably  (see  Figure  9). 


444 


Figure  9:  Adaptive  Driving  Units 


•  adjusting  sensor  view  frame:  The  sensor  view  frame 
with  respect  to  the  vehicle,  the  distance  and  the  direction 
to  the  driving  unit  from  the  vehicle,  is  adjusted  using  the 
pan  and  tilt  mechanism  of  the  sensor.  In  most  cases,  a 
longer  distance  to  the  next  driving  unit  allows  a  higher 
vehicle  speed.  If  the  processing  time  of  the 
PERCEPTION  and  the  PILOT  is  constant,  the  longer 
distance  means  a  higher  vehicle  speed.  But  the  longer 
distance  produces  less  accuracy  in  perception  and 
vehicle  position  estimation.  Therefore,  the  distance  is 
determined  for  the  required  accuracy,  which  depends  on 
the  complexity  of  passage.  Using  the  pan  and  tilt 
mechanism,  PERCEPTION  can  digitize  an  image  at  the 
best  distance  from  the  driving  unit,  since  the  sensor's 
view  frame  is  less  rigidly  tied  to  the  orientation  and 
position  of  the  vehicle. 


5.2.  Vehicle  Speed 

It  is  an  important  capability  for  autonomous  mobile  robot  to  adjust 
the  vehicle's  speed  automatically  so  that  the  vehicle  drives  safely  at 
the  highest  possible  speed.  The  current  system  slows  the  vehicle 
down  in  turning  to  reduce  driving  error. 

The  delay  in  processing  in  the  LOCAL  PATH  PLANNER  and 
communication  between  the  HELM  and  actual  vehicle  mechanism 
gives  rise  to  error  in  vehicle  position  estimation.  For  example, 
because  of  continuous  motion  and  non-zero  processing  time  the 
vehicle  position  used  by  the  LOCAL  PATH  PLANNER  as  a  starling 
point  differs  slightly  from  the  vehicle  position  when  the  vehicle  starts 
executing  the  plan.  Because  the  smaller  turning  radii  give  rise  to 
larger  errors  in  the  vehicle's  heading,  which  are  more  serious  than 
displacement  errors,  the  HELM  slows  vehicle  for  the  smaller  turning 
radii.  This  method  is  useful  for  making  the  vehicle  motion  stable. 

We  are  going  to  add  the  capability  to  the  system  for  adjusting  the 
vehicle  speed  to  the  highest  possible  value  automatically.  Our  idea  is 
the  following: 

•  schedule  token:  The  modules  and  the  submodules 
working  at  the  local  navigation  level  store  their  predicted 
processing  times  in  a  schedule  token  in  each  cycle. 
PERCEPTION  is  the  most  time  consuming  module,  and 
its  processing  time  varies  drasticly  from  task  to  task. 

•  adjusting  vehicle  speed:  Using  the  path  plan  and  the 
predicted  processing  lime  stored  in  the  schedule  token, 
the  HELM  calculates  and  adjusts  vehicle  speed  so  that 
the  speed  Is  maximum  and  the  modules  can  tinish 
processing  the  driving  unit  before  the  vehicle  reaches  the 
end  of  the  current  planned  trajectory. 


5.3.  Local  Path  Planning  and  Obstacle  Avoidance 

Local  path  planning  is  the  task  of  finding  a  trajectory  for  the  vehicle 
through  admissible  space  to  a  goal  point.  In  our  system,  the  vehicle  is 
constrained  to  move  in  the  ground  plane  around  obstacles 
(represented  by  polygons)  while  remaining  within  the  driving  unit  (also 
a  polygon).  We  have  employed  a  configuration  space  approach 
{5](6j.  This  algorithm,  however,  assumes  that  the  vehicle  is 
omnidirectional.  Since  our  vehicles  are  not,  we  smooth  the  resultant 
path  to  ensure  that  the  vehicle  can  execute  it.  The  smoothed  path  is 
not  guaranteed  to  miss  obstacles.  We  plan  to  overcome  this  problem 
by  developing  a  path  planner  that  reasons  about  constraints  on  the 
vehicle’s  motion. 


6.  Navigation  Map 

Some  information  about  vehicle's  environment  must  be  supplied  to 
the  system  a  priori,  even  H  it  is  incomplete,  and  even  if  it  is  nothing 
more  than  a  data  format  for  storing  explored  terrain.  The  user 
mission,  for  example,  “turn  at  the  second  cross  intersection  and  stop 
in  front  of  the  three  oak  trees"  does  not  make  sense  to  the  system 
without  a  description  of  environment.  The  Navigation  Map  is  a  data 
base  to  store  the  environment  description  needed  for  navigation. 


6.1 .  Map  Structure 

The  navigation  map  is  a  set  of  descriptions  of  physical  objects  in 
navigation  world.  It  is  composed  of  two  parts,  the  geographical  map 
and  the  object  data  base.  The  geographical  map  stores  object 
locations  with  their  contour  polylines.  The  object  data  base  stores 
object  geometrical  shapes  and  other  attributes,  for  example,  the 
navigation  cost  of  objects.  Though,  in  the  current  system,  all  objects 
are  described  with  both  of  the  geographical  map  and  the  object  data 
base,  In  general,  either  of  them  can  be  unused.  For  example,  the 
location  ol  stairs  A  is  known,  but  its  shape  is  unknown. 

The  shape  description  is  composed  of  two  layers.  The  first  layer 
stores  shape  attributes.  For  example,  the  width  of  the  road,  the  length 
of  the  road,  the  height  of  the  stairs  ,  the  number  of  steps,  etc.  The 
second  layer  stores  actual  geometrical  shapes  represented  by  the 
surface  description.  It  is  easy  to  describe  incomplete  shape 
information  with  only  the  first  layer. 


6.2.  Data  retrieval 

The  map  data  is  stored  in  the  CODGER  data  base  as  a  set  of 
tokens  forming  tree  structure.  In  order  to  retrieve  map  data,  parents 
tokens  have  indexes  to  children  tokens.  Because  current  CODGER 
system  provides  modules  with  a  token  retrieval  mechanism  that  can 
pick  up  only  one  token  at  a  time,  retrieving  large  portions  of  the  map  is 
cumbersome.  We  plan  lo  extend  CODGER  so  that  it  can  match  and 
retrieve  larger  structures,  possibly  combined  with  an  inheritance 
mechanism. 


7.  Other  Tasks  of  the  System 

Navigation  is  just  one  goal  of  a  mobile  robot  system.  Generally 
speaking,  however,  navigation  itself  is  not  an  end,  but  actually  a 
means  to  achieve  (he  final  goals  of  the  autonomous  mobile  robot 
system,  such  as  carrying  baggage,  exploration,  or  refueling. 
Therefore,  the  system  architecture  must  be  able  to  accommodate 
tasks  other  than  navigation. 

Figure  10  illustrates  one  example  of  an  extended  system 
architecture  which  loads,  carries  and  unloads  baggage.  The  whole 
system  is  comprised  of  four  layers,  mission  control,  vehicle  resource 
management,  signal  processing,  and  physical  hardware.  The 
CAPTAIN,  only  one  module  in  the  mission  control  layer,  stores  the 
user  mission  steps,  sends  them  to  the  vehicle  resource  management 
layer  one  by  one,  and  oversees  their  execution. 


445 


Figure  10: 


Extended  System  Architecture 


In  the  vehicle  resource  management  layer,  there  are  different 
modules  working  for  different  tasks.  Although  their  tasks  are  different, 
they  all  work  in  a  symbolic  domain  and  do  not  handle  the  physical 
world  directly.  These  modules  oversee  mission  execution,  generate 
plans,  and  pass  information  to  modules  in  the  signal  processing  layer. 
Through  CODGER,  they  can  communicate  with  each  other,  if 
necessary.  The  MAP  NAVIGATOR  and  the  PILOT,  parts  of  the 
navigation  system,  are  included  in  the  vehicle  resource  management 
layer.  The  MANIPULATOR  makes  a  plan  (e.g.,  how  to  load  and 
unload  baggage  with  the  arm)  and  sends  it  to  the  ARM 
CONTROLLER. 

The  modules  in  the  signal  processing  layer  interact  with  physical 
world  using  senors  and  actuators.  For  example,  PERCEPTION 
processes  the  signals  from  sensors,  the  HELM  drives  the  physical 
vehicle,  and  the  ARM  CONTROLLER  operates  the  robot  arm.  The 
bottom  level  contains  the  real  hardware,  even  if  it  includes  some 
primitive  controller.  The  sensors,  the  physical  vehicle,  and  the  robot 
arm  are  included  in  this  layer. 

Because  our  current  system  architecture  is  built  on  the  CODGER 
system  it  will  be  easy  to  expand  it  to  include  these  additional 
capabilities. 


8.  Conclusions 

In  this  paper,  we  have  described  the  CMU  architecture  for 
autonomous  outdoor  navigation.  The  system  is  highly  modular  and 
includes  components  for  both  global  and  local  navigation.  Global 
navigation  is  carried  out  by  a  route  planner  that  searches  a  map 
database  to  find  the  best  path  satisfying  a  mission  and  oversees  Its 
execution.  Local  navigation  is  carried  out  by  modules  that  use  a  color 
camera  and  a  laser  rangefinder  to  recognize  roads  and  landmarks, 
scan  for  obstacles,  reason  about  geometry  to  plan  paths,  and  oversee 
the  vehicle's  execution  of  a  planned  trajectory 

The  perception,  planning,  and  control  components  are  integrated 
into  a  single  system  through  the  CODGER  software  system. 
CODGER  provides  a  common  data  representation  scheme  tor  all 
modules  in  the  system  with  special  attention  paid  to  geometry. 
CODGER  also  provides  primitives  for  synchronizing  the  modules  in  a 
way  that  maximizes  parallelism  at  both  the  local  and  global  levels, 

We  have  demonstrated  our  system's  ability  to  drive  around  a 
network  of  sidewalks  and  along  a  curved  road,  recognize  complicated 
landmarks,  and  avoid  obstacles.  Future  work  will  focus  on  improving 
CODGER  for  handling  more  difficult  sensor  fusion  problems.  We  will 
also  work  on  better  schemes  for  local  navigation  and  will  strive  to 
reduce  our  dependence  on  map  data. 


9.  Acknowledgements 

The  design  of  our  architecture  was  shaped  by  contributions  from 
the  entire  Autonomous  Land  Vehicle  group  at  CMU.  We  extend 
special  thanks  to  Steve  Shafer.  Chuck  Thorpe,  and  Takeo  Kanade. 


I.  References 

[1]  Durrant-Whyte,  H. 

Integration,  Coordination  and  Control  of  Mult-Sensor  Robot 
Systems. 

PhD  thesis.  University  of  Pennsylvania,  1986. 

[2]  Goto,  Y.,  Matsuzaki,  K.,  Kweon,  I.,  Obatake,  T. 

CMU  Sidewalk  Navigation  System. 

In  FJCC-86.  1986. 

[3]  Hebert,  M.  and  Kanade,  T. 

Outdoor  Scene  Analysis  Using  Range  Data. 

In  Proc.  1986  l£EE  Conference  on  Robotics  and  Automation. 
April,  1986. 

[4]  Kanade,  T.,  Thorpe,  C.,  and  Whittaker,  W. 

Autonomous  Land  Vehicle  Project  at  CMU. 

In  Proc.  1986  ACM  Computer  Conference.  Cincinnati, 
February,  1986. 

[5]  Lozano-Perez.  T.,  Wesley,  M.  A. 

An  Algorithm  for  Planning  Collison-Free  Paths  Among 
Polyhedral  Obstacles. 

Communications  of  the  ACM  22(  1 0),  October,  1979. 

[6]  Lozano-Perez,  T. 

Spatial  Planning:  A  Configuration  Space  Approach. 

IEEE  Transactions  on  Computers  C-32(2),  February,  1983. 

[7]  Mfchail,  E.  M.,  Ackerman,  F. 

Observations  and  Least  Squares. 

University  Press  of  America,  1976. 

[S]  Shafer,  S..  Stentz,  A..  Thorpe,  C. 

An  Architecture  tor  Sensor  Fusion  in  a  Mobile  Robot. 

In  Proc.  IEEE  International  Conference  on  Robotics  and 
Automation.  April,  1986. 

[9]  Stentz,  A.,  Shafer,  S. 

Module  Programmer's  Guide  to  Local  Map  Builder  tor 
NAVLAB. 

1986. 

In  Preparation. 

[10J  Wallace,  R.,  Stentz,  A.,  Thorpe,  C.,  M oravec,  H.,  Whittaker, 
W.,  Kanade,  T. 

First  Results  in  Robot  Road-Following. 

In  Proc.  UCAI-85.  August,  1985. 

[11)  Wallace,  R.,  Matsuzaki,  K.,  Goto,  Y„  Webb,  J.,  Crisman,  J„ 
Kanade,  T. 

Progress  in  Robot  Road  Following. 

In  Proc.  IEEE  International  Conference  on  Robotics  and 
Automation.  April,  1986. 


446 


QUALITATIVE  NAVIGATION 


Tod  S.  Levitt,  Daryl  T.  Lawton,  David  M.  Chelberg,  Philip  C.  Nelson 


Advanced  Decision  Systems,  Mountain  View,  California  94040 


ABSTRACT 

Existing  robot  navigation  techniques  such  as  tri¬ 
angulation,  dead  reckoning,  ranging  sensors,  or  stereo, 
depend  on  accurate  range  data,  correspondence  of  map 
data  with  the  robot’s  location,  good  estimates  of  the 
robot’s  motion,  inertial  navigation  systems,  or  upon 
local  obstacle  avoidance  techniques.  These  approaches 
tend  to  be  brittle,  accumulate  error  and  utilize  little  or 
no  perceptual  information. 

This  paper  describes  a  formal  theory  that  depends 
on  visual  landmark  recognition  for  representation  of 
environmental  locations,  and  encodes  local  perceptual 
knowledge  in  structures  called  viewframes.  Paths  in 
the  world  are  represented  as  sequences  of  sets  of  land¬ 
marks.  viewframes,  and  other  distinctive  visual  events. 
Approximate  headings  are  computed  between 
viewframes  that  have  lines  of  sight  to  common  land¬ 
marks.  Range-free,  topological  descriptions  of  place, 
called  orientation  regions,  are  rigorously  abstracted 
from  viewframes.  They  yield  a  coordinate-free  model 
of  visual  landmark  memory  that  can  also  be  used  for 
navigation  and  guidance.  With  this  approach,  a  robot 
can  opportunistically  observe  and  execute  visually  cued 
“shortcuts’  .  Map  and  metric  data  are  not  required, 
but,  if  available,  are  handled  in  a  uniform  representa¬ 
tion  within  the  qualitative  navigation  techniques. 


1.  INTRODUCTION 

The  questions  that  define  navigation  and  gui¬ 
dance  are: 

•  Where  am  I? 

•  Where  are  other  places  relative  to  me? 

•  How  do  I  get  to  other  places  from  here? 


A  robot  that  moves  about  the  world  must  be  able  to 
compute  answers  to  these  questions.  This  paper  is  con¬ 
cerned  with  the  structure  and  processing  for  robotic 
visual  memory  and  inference  that  yields  navigation  and 
guidance.  Here,  guidance  means  the  problem  of  visual 
path-planning.  That  is,  the  input  data  is  assumed  to 
be  percepts  extracted  from  imagery,  and  a  database, 
i.e.,  memory,  of  models  for  visual  recognition.  A  priori 
model  and  map  data  is  only  relevant  insofar  as  it  pro¬ 


vides  a  basis  for  runtime  recognition  of  observable 
events.  This  is  distinguished  from  path  traversability 
planning  where  the  guidance  questions  concern  comput¬ 
ing  shortest  distances  between  points  under  constraints 
of  support  of  the  ground  or  surrounding  environment 
for  the  robotic  vehicle. 

A  multi-level  theory  of  spatial  representation 
based  upon  the  observation  and  re-acquisition  of  dis¬ 
tinctive  visual  events  (i.e.,  landmarks)  has  been 
developed.  The  representation  provides  the  theoretical 
foundations  for  a  visual  memory  database  that  includes 
coordinate  free,  topological  representation  of  relative 
spatial  location,  yet  smoothly  integrates  available 
metric  knowledge  of  relative  or  absolute  angles  and  dis¬ 
tances.  Rules  and  algorithms  are  presented  that,  under 
the  assumption  of  correct  association  of  landmarks  on 
re-acquisition  (although  not  assuming  landmarks  are 
necessarily  re-acquired)  provide  a  robot  with  navigation 
and  guidance  capability.  The  visual  memory  con¬ 
structed  while  moving  through  the  environment  con¬ 
tains  sufficient  data  to  deduce  or  update  a  map  of  the 
environment. 

Existing  robot  navigation  techniques  include  tri¬ 
angulation  Matthies  and  Shafer  -  86  .  ranging  sensors 
Hebert  and  Kanade  -  86  ,  auto-focus  Pentland  -  85  , 
stereo  techniques  Lucas  and  Kanade  -  84;,  Eastman 
and  Waxman  -  85  ,  dead  reckoning,  inertial  navigation, 
geo-satellite  location,  correspondence  of  map  data  with 
the  robot's  location,  and  local  obstacle  avoidance  tech¬ 
niques  Moravec  -  80  .  These  approaches  tend  to  be 
brittle  Bajcsy  et.al.  -  86  ,  accumulate  error  Smith  and 
Cheeseman  -  85  ,  are  limited  by  the  range  of  an  active 
sensor.  depend  on  accurate  measurement  of 
distance,  direction  perceived  or  traveled,  and  are  non- 
perceptual,  or  only  utilize  very  we.  k  perceptual  models. 

Furthermore,  these  theories  are  largely  concerned 
with  the  problem  of  measurement  and  do  not  centrally 
address  issues  of  map  or  visual  memory  and  the  use  of 
this  memory  for  inference  in  vision-based  navigation 
and  guidance.  Exceptions  to  this  are  the  work  of 
Davis  -  86  ,  McDermott  and  Davis  -  84  ,  and  Kuipers 
-  77  .  Davis  addressed  the  problem  of  representation 
and  assimilation  of  2D  geometric  memory,  but  assumed 
an  orthographic  view  of  the  world  and  did  not  consider 
navigation  or  guidance.  McDermott  and  Davis 
developed  an  ad  hoc  mixture  of  vector  and  topological 
based  route  planning,  but  assumed  a  map,  rather  than 
vision  derived  world  (in  their  assumptions  of  knowledge 
of  boundaries,  their  shapes,  and  spatial  relationships), 
had  no  formal  theory  relating  the  multiple  levels  of 
representation,  and  consequently  did  not  derive  or 


447 


implement  results  about  path  execution.  Kuipers 
developed  qualitative  techniques  for  navigation  and  gui¬ 
dance.  He  assumed  capability  of  landmark  recognition, 
as  we  do,  but  relied  on  dead-reckoning  and  constraint 
to  one-dimensional  (road)  networks  to  permit  path 
planning  and  execution. 

In  this  paper,  we  develop  representation  and 
inference  for  relative  geographic  position  information 
that: 

•  build  a  memory  of  the  environment  the  robot 
passes  through, 

•  contains  sufficient  information  to  allow  the 
robot  to  re-trace  its  paths, 

•  can  be  used  to  construct  or  update  an  a  pos¬ 
teriori  map  of  the  geographic  area  the  robot 
has  passed  through,  and 

•  can  utilize  all  available  information,  including 
that  from  runtime  perceptual  inferences  and  a 
priori  map  data,  to  perform  navigation  and  gui¬ 
dance. 


To  accomplish  this,  the  notion  of  a  geographic  “place’’ 
is  defined  in  terms  of  data  about  visible  landmarks.  A 
place,  as  a  point  on  the  surface  of  the  ground,  is 
defined  by  the  landmarks  and  spatial  relationships 
between  landmarks  that  can  be  observed  from  a  fixed 
location.  More  generally  we  can  define  a  place  as  a 
region  in  space,  in  which  a  fixed  set  of  landmarks  can 
be  observed  from  anywhere  in  the  region,  and  relation¬ 
ships  between  them  do  not  change  in  some  appropriate 
qualitative  sense.  Data  about  places  is  stored  in  struc¬ 
tures  called  viewframes,  b.  undaries  and  orientation 
regions. 

Viewframes  provide  a  definition  of  place  in  terms 
of  relative  angles  and  angular  error  between  landmarks, 
and  very  coarse  estimates  of  the  absolute  range  of  the 
landmarks  from  our  point  of  observation.  Boundaries 
and  orientation  regions  provide  a  more  qualitative 
definition  of  place.  Both  concepts  allow  us  to  localize 
ourselves  in  space  relative  to  a  set  of  observed  land¬ 
marks,  without  necessarily  using  a  priori  map  data. 


Viewframes  allow  us  to  localize  our  position  in 
space  relative  to  observable  local  landmark  coordinate 
systems.  In  performing  a  viewframe  localization,  we 
can  make  use  of  observed  or  inferred  data  about  our 
approximate  range  to  landmarks.  Errors  in  ranging 
and  relative  angular  separation  between  landmarks  are 
smoothly  accounted  for.  A  priori  map  data  can  also  be 
incorporated. 

If  we  drop  all  range  information,  we  can  still  use 
the  notion  of  boundaries  to  determine  our  qualitative 
position  relative  to  other  landmarks.  A  pair  of  land¬ 
marks  creates  a  virtual  division  of  the  ground  surface 
by  the  line  connecting  the  two  landmarks.  The  observ¬ 
able  relative  orientation  of  the  landmarks,  i.e.,  the  left- 
to-right  order  of  the  pair  of  landmarks,  indicates  which 
side  of  the  landmark-pair-boundarv  (LPB)  we  are  on. 
A  set  of  LPB’s  with  orientations  determines  a  region  on 
the  ground  called  an  orientation  region.  LPB's  can  be 
derived  by  considering  pairs  of  landmarks  from 
viewframes:  in  this  case  we  speak  of  the  set  of  orienta¬ 
tion  regions  induced  by  or  associated  to  the  viewframe. 

As  a  robot  moves  over  the  ground  surface,  the 
crossing  of  LPB's  indicates  passage  from  one  orienta¬ 
tion  region  to  another.  Thus,  orientation  regions  are 
bounded  by  the  unique  observable  visual  events  of 
passing  to  the  right  of,  left  of,  or  between  a  pair  of 
landmarks.  In  this  manner.  LPB’s  and  orientation 
regions  yield  a  natural  notion  of  headings  and  paths  in 
the  environment. 

Paths  in  the  world  are  represented  as  sequences  of 
observations  of  landmarks,  viewframes.  LPB’s  and 
other  distinctive  visual  events.  We  compute  approxi¬ 
mate  headings  between  viewframes  that  have  lines  of 
sight  to  common  landmarks.  This  allows  a  robot  to 
opportunistically  observe  visually  cued  ‘'shortcuts’’. 

We  define  headings  as  world  states  that  can  be 
created  by  a  robotic  action,  namely  that  of  following 
the  heading  direction  specifier,  and  whose  negation,  i.e., 
failure  to  maintain  the  heading  specifier  while  in 
motion,  can  either  be  created  by  robotic  actions  or  per¬ 
ceptually  observed  in  environmental  events.  We  make 
computational  the  concept  of  maintaining  a  heading 
until  an  observable  or  inferable  termination  condition  is 
met.  While  a  heading  is  being  executed,  the  vision  sys¬ 
tem  builds  up  a  visual  representation  of  the  environ- 


Figure  2-1:  Sensor  Centered  Coordinate  System 


448 


meat  it  is  passing  through.  Termination  conditions 
correspond  to  observable  or  computable  changes  in  our 
location  in  space,  triggering  additional  spatial  inference 
processes  if  we  have  not  reached  our  destination  goal. 

The  robust,  qualitative  properties  and  formal 
mathematical  basis  of  this  representation  and  inference 
processes  are  suggestive  of  the  navigation  and  guidance 
behavior  in  animals  and  humans  [Schone  •  84].  How¬ 
ever,  we  make  no  claims  of  biological  foundations  for 
this  approach. 

In  the  following,  we  first  develop  the  mathemati¬ 
cal  theory  of  viewframes,  boundaries  and  orientation 
regions.  We  then  show  how  these  qualitative, 
topological  concepts  interact  with  a  priori  metric  map 
data.  Perceptual  definitions  of  landmarks  and  require¬ 
ments  for  vision  system  performance  in  re-acquisition  of 
landmarks  to  support  the  proposed  perception-based 
navigation  approach  are  reviewed  in  Section  3.  Infer¬ 
ence  for  path  planning  over  viewframe,  boundary  and 
orientation  region  representations  are  presented  in  Sec¬ 
tion  4,  and  results  are  presented  in  Section  5. 


2.  TOPOLOGICAL  LANDMARK  NETWORK 
REPRESENTATIONS 


2.1  VIEWFRAMES 

A  viewframe  encodes  the  observable  landmark 
information  in  a  stationary  panorama.  That  is,  we 
assume  that  the  sensor  platform  is  stationary  long 
enough  for  the  sensor  to  pan  up  to  380  degrees,  to  tilt 
up  to  90  degrees  (or  to  use  an  omni-directional  sensor 
Cao  et.al.  -  86] ),  to  recognize  landmarks  in  its  field  of 
view,  or  to  buffer  imagery  and  recognize  landmarks 
while  in  motion. 

To  generate  a  viewframe,  relative  solid  angles 
between  distinguished  points  on  landmarks  are  com¬ 
puted  using  a  sensor-centered  coordinate  system.  A 
distinguished  point  on  a  landmark  may  be  a  statisti¬ 
cally  derived  point  such  as  the  centroid  of  the  projec¬ 
tion  of  the  landmark  in  an  image,  a  structurally 
derived  point  such  as  a  vertex  or  high  curvature  boun¬ 
dary  point,  or  ,  erceptually  derived  as  in  a  unique 
visual  feature  of  the  landmark.  Representation  of  land¬ 
marks  and  choice  of  distinguished  points  is  presented  in 
greater  detail  in  Section  3. 

A  sensor-centered  spherical  coordinate  system  is 
established.  It  fixes  an  orientation  in  azimuth  and 
elevation,  and  takes  the  direction  opposite  the  current 
heading  as  the  zero  degree  axis.  Then  two  landmarks 
in  front  of  us,  relative  to  our  heading,  will  have  an 
azimuth  separation  of  less  than  180  degrees.  If  we 
assume  that  no  two  distinguished  landmark  points  have 
the  same  elevation  coordinates  /i.e.,  no  two  dis¬ 
tinguished  points  appear  one  directly  above  the  other) 
then  we  obtain  a  well-ordering  of  the  landmarks  in  the 
azimuth  direction,  which  we  can  speak  of  as  "ordered 
from  left  to  right”.  The  relative  solid  angle  between 
two  distinguished  landmark  points  is  now  well  defined; 
see  Figure  2-1. 


Under  the  above  assumptions,  we  can  pan  from 
left  to  right,  recognizing  landmarks,  L, ,  and  storing  the 
solid  angles  between  landmarks  in  order,  denoting  the 
angle  between  the  i-th  and  j-th  landmarks  by  AngtJ . 
The  basic  viewframe  data  are  these  two  ordered  lists, 
(L  VL  j,...)  and  (Ang12,Ang23,...).  The  relative  angular 
displacement  between  any  two  landmarks  can  be  com¬ 
puted  from  this  basic  list.  However,  the  angle  between 
landmarks  that  is  more  readily  measured  is  the  planar 
angle,  9^  ,  established  by  the  sensor  focal  point  and  the 
two  distinguished  points  observed  in  the  i-th  and  j-th 
landmarks.  There  is  an  error  in  computing  these  rela¬ 
tive  angles  that  is  at  least  as  great  as  resolution  of  the 
vision  system,  and  may  include  cumulative  pan/tilt 
error,  angular  ambiguity  in  landmark  point  localization, 
or  other  error  sources.  The  angular  error  is  measured 
by  etJ  between  landmarks  i  and  j .  Finally,  range 
estimates  are  required  for  landmarks  record-.-d  in 
viewframes.  These  estimates  can  be  arbitrarily  coarse, 
but  finite.  We  only  require  that  the  true  range  lie 
between  the  bounds  specified  for  the  estimate.  We 
denote  the  range  interval  associated  to  landmark  L ,  by 
[r,  „r,  j].  We  now  explain  how  it  is  possible  to  localize 
ourselves  in  space  relative  to  these  observed  landmarks. 

We  begin  by  noting  that  the  set  of  points  in  3- 
space  from  which  we  can  observe  an  angle  of  6tJ 
between  landmarks  L,  and  is  constrained  to  a 
closed  torus-like  surface;  a  cut-away  of  this  surface  is 
pictured  in  Figure  2-2.  This  is  more  easily  observed  in 
a  planar  cross-section,  where  the  shape  is  the  figure 
eight  cross-section  in  Figure  2-3.  To  prove  this,  we  set 
up  a  polar  coordinate  system  with  the  origin  at  L,  ,  and 
with  Lj  at  coordinates  ,0],  where  s,;  is  the  fixed. 


Figure  2-2:  Constant  Angle  Toroid 


but  unknown,  distance  between  the  landmarks.  Denote 
the  (unknown)  distance  from  the  sensor  to  L,  by  r, 
and  the  distance  to  L.  by  r; .  We  now  ask,  what  are 
the  set  of  points  in  the  plane,  [r.ang],  from  which  we 
can  observe  an  angle  of  9,,  between  L,  and  t(  ?  This 


449 


situation  is  pictured  in  Figure  2-3.  We  can  now  com¬ 
pute: 


s,}  cos <t>  +  s,;  cot  fein<5 


=  cot  ' 


—cscS  -  cot  8 

T  . 


The  first  equation  is  the  polar  form  for  a  circle  with 


polar  center 


°-cs c8,  T 


8 


and  radius 


-esc  8.  It 


*  2  2 

has  singularities  where  r,  or  '“rJ  are  equal  to  zero.  By 
symmetry  we  obtain  the  figure  eight  like  shape  pictured 
in  Figure  2-3.  Rotating  the  circular  arc  in  3-space 
about  the  axis  defined  by  the  line  segment  joining  L, 
and  Lj ,  we  obtain  the  figure  pictured  in  Figure  2-2. 


Figure  2-3:  Constant  Angle  Circular  Arcs 

Figure  2-3  shows  how  varying  the  sensor  location 
along  the  circular  arc  is  equivalent  to  varying  the  abso¬ 
lute  ranges  to  the  two  landmarks.  If  we  can  bound  the 
ranges  to  the  landmarks,  then  we  can  localize  ourselves 
along  the  circular  arc  accordingly.  This  is  logically 
equivalent  to  establishing  a  local  coordinate  frame 
between  the  landmarks,  and  bounding  our  location 
relative  to  that  frame.  If  we  can  register  the  landmarks 
with  a  priori  map  data,  then  we  can  know  and  use  the 
distance  between  the  landmarks,  but  this  is  not  neces¬ 
sary. 


Error  in  angular  measurement  of  the  8t) 
corresponds  to  different  choices  of  concentric  circular 
arcs  each  of  which  contains  L, ,  L}  and  the  sensor  focal 
point.  If  we  union  these  arcs  together,  we  obtain  the 
localization,  Loc,y  ,  of  the  sensor  relative  to  the  two 
landmarks.  Because  our  localization  must  be  true  rela¬ 
tive  to  all  observed  landmark  pairs  simultaneously,  the 
intersection  of  the  Locl;  over  all  i  >  j ,  give  our  best 
localization.  Such  a  localization  is  pictured  in  Figure 
5-2.  Intersecting  the  Loc,,  requires  that  the  Locl;  be 
represented  in  the  same  local  coordinate  frames.  We 
discuss  transformations  between  local  landmark  coordi¬ 
nate  frames  in  Section  4.  These  concepts  can  clearly  be 
extended  to  3-space  reasoning,  if  necessary. 


2.2  BOUNDARIES  AND  ORIENTATION 
REGIONS 

Viewframes  contain  two  basic  dimensions  of  data: 
the  relative  angles  between  landmarks,  and  the 
estimated  range  (intervals)  to  the  landmarks.  If  we 
drop  the  range  information,  we  are  left  with  purely 
topological  data.  That  is,  it  is  impossible,  using  only 
the  relative  angles  between  landmarks,  and  no  range, 
map  or  other  metric  data,  to  determine  the  relative 
angles  between  triples  of  landmarks,  or  to  construct 
parametric  representations  of  our  location  with  respect 
to  the  landmarks.  Nonetheless,  there  is  topological 
localization  information  present  in  the  ordinal  sequence 
of  landmarks;  there  is  a  sense  in  which  we  can  compute 
differences  between  geographic  regions,  and  observe 
which  region  we  are  in. 

The  basic  concept  is  to  note  that  if  we  draw  a 
line  between  two  (point)  landmarks,  and  project  that 
line  onto  the  (possibly  not  flat)  surface  of  the  ground, 
then  this  line  divides  the  earth  into  two  distinct 
regions.  If  we  can  observe  the  landmarks,  we  can 
observe  which  side  of  this  line  we  are  on.  The  “virtual 
boundary”  created  by  associating  two  observable  land¬ 
marks  together  thus  divides  space  over  the  region  in 
which  both  landmarks  are  visible.  We  call  these 
landmark-pair-boundaries  (LPB’s),  and  denote  the  LPB 
constructed  from  the  landmarks  L ,  and  L  2  by 
LPB(£  VL  2). 

Roughly  speaking,  if  we  observe  that  landmark 
L ,  is  on  our  left  hand,  and  landmark  L  t  is  on  our 
right,  and  the  angle  from  L  ,  to  L  {  (left  to  right)  is  less 
than  180  degrees,  then  we  denote  this  side  of,  or 
equivalently,  this  orientation  of,  the  LPB  by  \L  1  L  t\. 
If  we  stand  on  the  other  side  of  the  boundary, 
LPB(L  i,L  j),  “facing”  the  boundary,  then  L  2  will  be  on 
our  left  hand  and  L ,  on  our  right  and  the  angle 
between  them  less  than  180  degrees,  and  we  can  denote 
this  orientation  or  side  as  |L  t  L  (left  to  right). 

More  rigorously,  define: 

orientation-of-LPB{£  VL  j) 


=  signer- 0„) 


+  1  if  e„  <  w 

0  if  0jj  =  n 

-i ,f  e„  >  * 


450 


where  012  is  the  relative  azimuth  angle  between  L  t 
and  L  2  measured  in  an  arbitrary  sensor-centered  coor¬ 
dinate  system.  Here,  an  orientation  of  -t- 1  corresponds 
to  the  L  |  L  2  side  of  LPB (L  t,L  2),  -1  corresponds  to 
the  L  2  L  ,  side  of  LPB(L  VL  2)  and  0  corresponds  to 
being  on  LPB(L  ,,L  2).  It  is  a  straightforward  to  show 
that  this  definition  of  LPB  orientation  does  not  depend 
on  the  choice  of  sensor-centered  coordinate  system. 

LPB’s  give  rise  to  a  topological  division  of  the 
ground  surface  into  observable  regions  of  localization, 
called  orientation  regions.  Crossing  boundaries 
between  orientation  regions  leads  to  a  qualitative  sense 
of  path  planning  based  on  perceptual  information. 
Inference  for  perceptual  path  planning  is  explored  in 
Section  4.2. 

Figure  2-4(a)  shows  the  LPB's  that  are  implicit  in 
a  viewframe.  The  solid  lines  are  the  virtual  boundaries 
created  by  landmark  pairs.  Figure  2-4(a)  can  be  mis¬ 
leading  in  that  it  seems  to  imply  that  the  LPB's  con¬ 
tain  the  data  to  compute  the  angle-distance  geometry 
of  the  sensor  location  relative  to  the  landmarks.  Figure 
2-4(b)  shows  a  representation  of  the  same  viewframe. 
Here  the  ranges  to  the  landmarks  have  been  changed, 
but  the  ordinal  angular  relationships  between  land¬ 
marks  have  not.  The  angle-distance  geometry  of  our 
apparent  location  relative  to  the  landmarks  is  com¬ 
pletely  different;  however,  the  topological  information, 
that  is.  the  number  of  regions,  their  number  of  sides 
and  adjacency  relations,  are  preserved. 

More  specifically,  there  is  topological  information 
captured  in  the  orientation  region  representation  that 
distinguishs  regions,  yet  is  invariant  under  large 
motions  of  the  observing  sensor.  The  LPB’s  divide  the 
ground  surface  into  regions.  If  we  regard  the  boun¬ 
daries  of  regions  as  fattened  wireframes,  then  the  orien¬ 
tation  regions  may  be  thought  of  as  (topological)  holes 
in  the  surface.  If  we  view  the  surface  as  being  the 
whole  earth,  then  the  shape  formed  by  cutting  the 
orientation  regions  out  of  the  surface  of  the  (hollow) 
earth,  is  a  two  dimensional  manifold  with  boundary;  in 
particular,  the  shape  is  topologically  equivalent  to  a 
(two)  sphere  with  finitely  many  holes  cut  out  of  it. 
The  number  of  holes  is  the  number  of  orientation 
regions.  From  topology  we  have  that  two  orientable 
two  dimensional  manifolds  with  boundary,  are  topologi¬ 
cally  equivalent  if  and  only  if  they  have  the  same  genus 
and  the  same  number  of  holes  (i.e..  boundary  com¬ 
ponents)  Massey  -  67  .  In  this  case,  the  genus  of  a 
sphere  with  holes  in  it  is  zero,  so  two  shapes  induced  by 
orientation  regions  are  topologically  equivalent  if  and 
only  if  they  have  the  same  number  of  orientation 
regions. 

Notice  that  orientation  regions  are  equivalence 
classes  of  viewframes,  where  two  viewframes  are 
equivalent  if  the  anzimuth  angular  order  of  visible 
landmarks  is  the  same.  There  is  a  strong  relationship 
between  the  number  of  orientation  regions  the  LPB’s 
divide  space  into,  and  the  angular  order,  left  to  right  in 
azimuth,  observed  among  the  landmarks  in  the 
viewframe. 


To  see  this,  begin  with  a  single  LPB.  This  clearly 
divides  space  into  two  regions.  Reasoning  inductively, 
suppose  we  have  N  LPB’s  already  established.  If  we 


Figure  2-4:  Viewframe  Orientation  Regions 


introduce  another  LPB,  each  time  it  crosses  an  existing 
LPB,  it  divides  the  last  region  it  was  crossing  through 
into  two,  thus  generating  exactly  one  more  region.  If  it 
crosses  through  a  boundary  at  a  point  that  is  already 
the  crossing  point  for  two  or  more  other  LPB’s,  then, 
although  it  is  technically  crossing  two  (or  more)  LPB's 
simultaneously,  it  generates  only  one  more  region. 
Finally  note  that  after  the  LPB  has  no  more  other 
LPB's  to  cross  (which  will  happen  because  it  can  only 
cross  each  LPB  at  most  once),  it  continues  on  “to 
infinity”,  creating  exactly  one  more  (unbounded) 
region.  Thus,  the  total  number  of  regions  created  by  N 
LPB's  is  equal  to; 

number-orientation-regions  =  N  -  (number-of- 
crossings-with-multiplicites)  1 

Here  the  multiplicity  of  a  crossing  is  defined  as  1,  if  two 
LPB's  cross,  and,  in  general,  as  (number-of-LPBs- 
crossing  -  1)  for  two  or  more  LPB’s. 


451 


If  a  viewframe  contains  the  landmark  sequence 
L  VLVLVLV  then  LPB {L  UL  3)  and  LPB (L2,L4)  must 
cross.  It  is  possible  for  LPB(L  2)  and  LPB(L  3,L  4)  to 
be  parallel,  and  crossing  points  of  greater  than  multipli¬ 
city  1  can  happen;  however,  these  are  relatively  rare 
events.  It  follows  that  a  reasonable  estimate  of  the 
number  of  orientation  regions  implicit  in  a  viewframe 
can  be  gotten  by  assuming  that  all  LPB  crossings  gen¬ 
erated  by  pairs  of  landmarks  in  the  viewframe  have 
multiplicity  1  (except  for  the  crossings  through  land¬ 
marks  themselves,  which  have  multiplicity  K-l  for  K 
landmarks),  and  that  no  LPB's  are  parallel.  Under 
these  assumptions,  the  number  of  orientation  regions 
determined  by  a  viewframe  only  depends  on  the 
number  of  landmarks  in  the  viewframe. 

Notice  that  crossings  of  multiplicity  1  are  stable 
in  that  small  perturbations  of  the  locations  of  the  four 
landmarks  that  generate  the  two  crossing  LPB’s  cannot 
change  the  multiplicity  of  the  crossing.  However,  cross¬ 
ings  of  multiplicity  greater  than  l  are  inherently  singu¬ 
lar  (up  to  the  vision  system's  ability  to  resolve  land¬ 
mark  points).  Furthermore,  the  crossing  of  two  LPB's 
is  also  stable,  while  the  event  of  two  LPB’s  being  paral¬ 
lel  is  a  singular  event.  Interestingly,  the  parallelism  of 
two  LPB’s  is  not  observable  without  relative  range 
information. 

Vow  if  we  observe  a  viewframe  and  ignore  the 

M 

range  information,  we  can  still  observe  all  LPB  s 

created  by  taking  landmarks  in  pairs.  The  conjunction 
of  the  orientations  defines  the  boundaries  of  the  orien¬ 
tation  region  we  currently  occupy.  Any  conjunction  of 
LPB  orientations  that  shares  at  least  one  LPB,  but  has 
the  orientation  reversed,  must  be  in  a  different  region 
of  space. 

We  call  the  set  of  orientation  regions  created  by  a 
set  of  landmarks,  the  orientation-net  associated  to  the 
set  of  landmarks.  If  we  drop  one  of  the  LPB’s  in  the 
set  defining  an  orientation-net.  we  still  have  a  localiza¬ 
tion  of  space  defined  by  the  remaining  orientation 
regions.  The  previous  localization  must  be  a  refinement 
of  the  current  one.  Thus,  the  partially  ordered  set  of 
subsets  of  LPB's  induce  a  partial  order  on  refinements 
and  coarsenings  of  the  orientation  region  nets  on  the 
ground. 

We  can  conclude  from  this  analysis  that  if  we  are 
at  a  "stable  viewpoint "  relative  to  the  set  of  landmarks 
recorded  in  a  viewframe,  then  the  angular  order  of  the 
landmarks  completely  determines  the  orientation-region 
we  are  in.  within  the  orientation-net  determined  by  the 
landmarks  in  the  viewframe.  Thus  we  see  that  if  we 
assume  only  stable  crossings  of  LPB’s  then  the  multi- 
resolution  (i.e.,  partially  ordered)  set  of  orientation-nets 
derived  from  a  viewframe  induces  a  unique  topological 
representation  of  the  visible  space.  A  subset  of  the 
total  observable  set  of  LPB’s  can  be  used  for  spatial 
inference.  This  corresponds  to  moving  to  a  coarser  level 
of  orientation-net  in  which  several  finer  regions  may 
become  indistinguishable. 


2.3  RELATIONSHIP  TO  METRIC 
REPRESENTATIONS 

We  define  a  metric  representation  to  be  one  that 
contains  enough  data  to  compute  exact  (although  not 
necessarily  correct)  geometric  angle-distance  relation¬ 
ships  between  the  represented  points.  For  example, 
map  data  is  metric,  environmental  representations 
obtained  from  ranging  sensors  are  metric,  a  list  of  LTM 
coordinates  of  bridge  locations  is  metric,  and  an  exact 
angle  distance  graph  of  locally  maximum  elevation 
points  extracted  from  a  government  terrain  grid  data¬ 
base  is  a  metric  representation. 

There  are  numerous  ways  to  integrate  metric 
information  with  the  more  qualitative  viewframe,  boun¬ 
dary  and  orientation  region  representations  to  augment 
the  ability  of  a  land-based  robot  to  navigate,  to  guide 
itself  about  the  world,  and  to  cue  addition  and 
refinement  of  map  data  with  visually  acquired 
knowledge.  Here  we  present  several  methods  that 
make  use  of  the  metric  data  while  retaining  the  robust¬ 
ness  of  the  more  qualitative  representations.  These 
include  using  metric  data  for: 

•  refinement  of  range  estimates  in  viewframes 

•  prediction  of  appearance  and  location  of  new 
(i.e.,  currently  occluded  or  as  yet  unobserved) 
landmarks 

•  detection  of  conflict  between  predicted  land¬ 
mark  locations  and  LPB  crossings,  and 

•  computing  headings. 

The  first  three  are  presented  below.  The  last  is 
explained  in  section  1. 

If  we  can  correspond  landmarks  observed  in  a 
viewframe  with  topographic  map  or  grid  database  loca¬ 
tions,  then  we  know  the  distance  between  the  observed 
landmarks.  Knowing  the  distance  between  landmarks 
allows  to  refine  our  estimates  of  range  intervals  to 
observed  landmarks  as  follows.  The  constraint  that  the 
distance  between  the  landmarks  is  a  fixed  distance,  s  12, 
can  be  represented  by  the  law  of  cosines  applied  to  the 
triangle  formed  by  the  sensor  focal  point  and  the  two 
landmarks,  as  in  Figure  2-3.  If  r  ,  is  a  range  estimate 
to  L  [  and  r  2  is  a  range  estimate  to  L  2,  then  we  have 

sfi  =  r/  *  r22  -  2r,r,cos0l2 

Let  (r,,,r12)  be  the  range  interval  for  L  [  and  (r21.r22) 
be  the  range  interval  for  landmark  L  2.  Then  fixing  r  , 
at  r  u  and  solving  for  r2  yields  a  new  lower  bound  for 
the  range  interval  for  L  2,  and  fixing  r  t  at  r  ,2  yields  a 
new  upper  bound  for  the  interval.  (If  there  are  two 
possible  values  for  r  2,  take  the  minimum  or  maximum 
respectively.)  If  either  of  the  new  bounds  are  worse 
than  the  original  estimates,  then  the  information  may 
be  used  to  refine  the  bounds  on  the  range  interval  for 
L  |  instead  of  L  2.  However,  if  we  order  the  landmark 
pairs  from  smallest  to  largest  range  intervals,  then  no 
back-tracking  will  be  necessary,  and  an  algorithm  for 
refining  the  intervals  in  a  viewframe  ran  be  done  in 
linear  time  in  the  number  of  landmark  pairs. 


452 


If  constraints  on  viewframe  ranges  from  metric 
data  result  in  a  “negative"  interval  (i.e.,  maximum 
range  being  less  than  the  minimum)  this  indicates  that 
either  the  true  range  value  to  a  landmark  does  not  lie 
in  the  estimated  range  interval  or  that  the  metric  data 
is  incorrect.  This  conflict  can  be  used  to  cue  alterna¬ 
tive  inference  strategies  in  path  planning,  such  as  only 
relying  on  coherent  data  in  the  path  planning. 


Figure  2-5:  Multiple  Levels  of  Spatial  Representation 


To  use  metric  data  for  prediction  of  landmarks  in 
imagery,  we  begin  by  mapping  the  viewframe- 
localization  regions  into  a  digitized  terrain  map.  This 
is  accomplished  by  computing  the  landmark  pair  circles 
(see  Figure  2-3)  and  representing  them  as  a  bit  plane 
overlay  on  the  terrain  grid.  Because  the  landmarks 
have  fixed  locations  on  the  grid,  we  can  AND  the  bit 
planes  corresponding  to  multiple  landmark  pairs,  range 
intervals  and  angular  errors,  to  find  the  set  of  grid 
points  that  corresponds  to  the  viewframe-localization 
region.  Computationally,  we  have  saved  the  additional 
error  we  obtain  by  computing  multiple  local  coordinate 
transforms  between  landmark  pairs  and  intersecting  the 
results  algebraically.  This  is  possible  because  mapping 
the  viewframes  into  the  metric  data  gives  us  an  abso¬ 
lute  coordinate  system  to  work  in. 


Landmarks  such  as  horizon  extrema,  change  of 
groundcover,  and  manmade  objects  can  be  predicted  a 
priori  in  the  grid  data.  We  can  use  the  metric  eleva¬ 
tion  data  to  compute  the  visibility  of  landmarks  from 
selected  points  in  the  viewframe  overlaid  on  the  grid 
data.  See  (Lawton  et.a',.  -  86  ,  as  an  example  of  an 
implementation  of  sucn  a  visibility  and  image  predic¬ 
tion.  These  predictions  also  yield  range  estimates  that 
depend  on  the  size  of  the  viewframe-localization. 

Finally,  mapping  viewframes  into  metric  data 
allows  us  to  predict,  given  a  heading,  the  LPB‘s  that 
we  should  cross  first.  We  again  represent  the  heading 
as  a  bit-plane  overlay  of  a  line  (or  other  one¬ 
dimensional  curve)  on  the  terrain  grid.  LPB’s  between 
landmarks  are  also  drawn  as  overlays.  Following  along 
the  heading  line  orders  the  boundary  crossings  obtained 
by  A.\'D-ing  the  overlays.  Because  the  viewframe- 
localization  is,  in  general,  a  region  rather  than  a  point, 
headings  can  be  drawn  from  multiple  points  in  the 
region  of  localization.  LPB  crossings  that  agree  in 
order  across  the  multiple  heading  possibilities  can  be 
used  as  predictors  to  flag  conflicts  as  potential  errors  in 
landmark  recognition  (i.e.,  bad  association  to  the 
metric  data),  or  as  inaccuracies  in  the  metric  data. 

The  three  spatial  representation  levels  of  metric 
data,  viewframes  and  orientation  regions  are  shown  in 
Figure  2-5.  In  Section  4,  we  discuss  inference  strategies 
that  use  information  at  all  three  levels  to  perform  navi¬ 
gation  and  guidance. 

3.  REQUIREMENTS  FOR  PERCEPTION 

Ideal  landmarks  for  navigation  and  map-building 
are  uniquely  distinguishable  points  that  are  visible  from 
anywhere  with  precisely  determined  range.  The  real 
world,  of  course,  consists  of  objects  that  occlude  each 
other  and  look  very  different  under  varying  viewing 
conditions.  Significant  perceptual  and  cognitive  infer¬ 
ence  is  required  to  recognize  the  same  objects  from 
different  places  and  environmental  conditions,  lor 
navigation  and  map  building  in  particular,  the  changes 
in  object  appearance  can  be  considerable  because  the 
location  refining  power  of  landmarks  depends  on  relat¬ 
ing  landmarks  from  highly  separated  points  of  observa¬ 
tion.  In  this  section,  we  present  some  of  the  require¬ 
ments  and  assumptions  we  use  concerning  perceptual 
processing  for  landmark  extraction  and  recognition. 

3.1  LANDMARKS 

Landmarks  are  distinctive  and  stable  objects  or 
perceptual  events.  One  distinguishing  dimension  for 
landmarks  is  the  extent  to  which  they  are  perceptual 
events  or  instances  of  known  types  of  objects.  Percep¬ 
tual  processing  can  extract  landmarks  using  distinctive 
image  events  that  are  unique  in  multiple  fields  of  view 
and"  are  stable  during  observer  motion  but  are  not 
necessarily  related  to  any  type  of  nodded  object. 
Landmarks  that  are  an  instance  of  an  object  model 
have  default  model-based  information  associated  with 
them  that  can  simplify  and  direct  the  accumulation  of 
information  as  views  of  the  landmark  changes,  espe¬ 
cially  due  to  model-based  scale  constraints.  If  a  percep¬ 
tually,  but  not  model-based,  landmark  is  npt  stable 


453 


from  different  views,  it  will  be  represented  multiple 
times  as  different  landmarks.  Another  perceptual  pro¬ 
cessing  requirement  for  landmarks  in  the  topological 
representation  is  to  be  able  to  extract  and  monitor 
specific  qualitative  singularities,  such  as  co-linearity  of 
landmarks,  or  an  LPB  crossing. 

A  further  classification  for  landmarks  is  whether 
they  are  of  type  point,  composite,  or  extended  (see  Fig¬ 
ure  3-1).  Point  landmarks  are  highly  identifiable  and 
localizable.  Composite  landmarks  consist  of  tightly 
connected  point  landmarks  corresponding  to  sub-parts, 
features  or  details.  Composite  properties  are  important 
for  localization  with  respect  to  landmarks  with  orienta¬ 
tional  features  such  as  the  faces  of  a  distant  building. 
Extended  landmarks  correspond  to  rivers,  roads,  moun¬ 
tain  ranges,  and  distinguished  terrain-type  boundaries. 
Knowledge  of  them  is  important  for  path  planning 
because  of  the  global  effects  they  have  on  allowable 
paths. 

Perceptually,  a  localization  relative  to  extended 
landmarks  involves  adjacency,  sidedness,  and  average 
distance  estimates.  Extended  landmarks  can  be  stable 
from  image  to  image,  but  provide  no  basis  for  precise 
localization  unless  point  landmark  events  are  simul¬ 
taneously  observed.  It  is  perceptually  important  to 
note  the  relations  between  these  different  types  of  land¬ 
marks  over  time,  and  to  accumulate  this  information  as 
part  of  the  stored  landmark.  Thus,  the  set  of  point 
landmarks  associated  with  composite  and  extended 
landmarks  must  be  explicitly  represented. 


Figure  3-1:  Landmark  Types 


3.2  PERCEPTION  SYSTEM 

Our  perceptual  system  and  its  relation  to  land¬ 
mark  extraction  is  sketched  in  Figure  3-2.  It  consists  of 
an  organized  hierarchy  of  perceptual  events.  The  bot¬ 
tom  level  events  are  a  time-indexed  buffer  of  images 
and  image  registered  features;  the  next  level  consists  of 
several  types  of  basic  perceptual  objects,  such  as 
points,  curves  and  regions.  These  are  processed  into 
spatial  temporal  perceptual  groupings  that  reflect 


Figure  3-2:  Perceptual  Processing  for  Landmarks 

stable  perceptual  structures  in  images  over  time.  The 
perceptual  objects  and  groups  refer  to  positions  in  a 
sequence  of  images,  but  are  independent  of  a  particular 
view.  Associated  with  perceptual  objects  is  a  mechan¬ 
ism  for  selecting  those  that  are  inherently  interesting. 
This  controls  the  application  of  grouping  processes  to 
limit  the  complexity  of  matching  the  preeondlions  of 
grouping  rules  to  objects  and  relations.  It  also  serves 
to  extract  significant  perceptual  structures  to  cue 
matching  to  object  models. 

The  hypothesis  space  consists  of  instantiations  of 
object  models  and  viewframes  that  describe  the  relative 
positions  of  landmarks  from  a  point  of  observation  A 
view-frame  abstracts  from  the  current  scene  model  the 
details  associated  with  landmarks.  Viewframes  are  con¬ 
nected  to  each  other  along  a  path  with  pointers  to 
landmark  correspondence  and  changes  between 
viewframes.  The  hypothesis  space  includes  lists  of  the 
currently  visible  landmarks  associated  with  the 
interestingness  mechanism;  this  mechanism  is  used  to 
note  the  changes  in  landmarks  over  time,  such  as  occlu¬ 
sion  or  disocclusion  and  the  emergence  or  disappear¬ 
ance  of  detail.  The  occurence  of  these  events  deter¬ 
mines  when  new  landmarks  should  be  extracted  or  now 
viewframes  established. 


454 


3.3  SIMPLIFYING  ASSUMPTIONS 

There  are  several  assumptions  about  the  types  of 
sensors  and  their  arrangement  that  simplify  landmark 
determintation.  If  we  assume  the  observer  knows  the 
relation  between  gravity  and  positions  in  an  image,  we 
can  orient  each  pixel  with  respect  to  the  direction  of 
gravity  to  determine  the  vertical  orientation  of  distant 
landmarks.  We  can  also  constrain  the  localization  of 
the  global  horizon  line  and  the  sky.  We  have  found 
that  filtering  out  the  set  of  the  n-most  interesting 
structures  above  and  near  the  global  horizon  line  that 
are  distinct  from  sky,  provides  a  useful  set  of  land¬ 
marks.  In  addition,  the  sensor’s  orientation  to  the 
immediate  ground-plane  can  be  used  to  define  a  local 
horizon,  thus  determining  the  position  of  more  local 
landmarks. 

It  is  helpful  for  the  visual  sensors  to  have  as  com¬ 
plete  a  view  of  the  world  as  possible.  A  spherical  imag¬ 
ing  surface  provides  an  optimal  arrangement  [Cao  et.al. 
-  861  It  can  be  approximated  by  a  complete  coverage 
of  the  field  of  view  using  multiple  calibrated  cameras. 
This  resolves  many  problems  in  piecing  together  multi¬ 
ple  local  views  to  infer  an  omni-complete  view  of  the 
environment.  The  sensors  should  also  be  stabilized 
relative  to  the  environment  so  observer  motions  are 
effectively  a  sequence  of,  potentially  different,  transla¬ 
tions.  With  a  complete  coverage  of  the  field  of  view, 
there  is  a  lessened  need  to  reorientate  the  sensors. 

Monitoring  the  displacement  of  image  features 
relative  to  a  known  axis  of  translation  simplifies  frame 
to  frame  matching,  Bolles  and  Baker  -  85;  Tawton  - 
83  ,  and  provides  basic  information  for  determining 
how  distant  a  landmark  is,  for  how  long  a  time  it  is 
visible,  and  when  potential  occlusions  may  occur.  The 
classes  of  image  transformations  are  contrained  in  th°se 
cases.  Further,  for  the  spatial  representation  we  are 
building,  it  is  not  necessary  to  construct  a  precise  depth 
map,  but  to  track  image  events  in  terms  of  qualitative 
transformations  such  as  occlusion/disocclusion,  detail 
emergence/disappearance.  For  this,  approximate, 
object-based  matching  along  the  translational  flowpaths 
suffices.  Depending  on  sensitivity  of  this  computation, 
exact  stabilization  may  not  be  required,  and  inertial 
estimates  can  drift  over  time. 


4.  INFERENCE  FOR  NAVIGATION 
AND  GUIDANCE 


The  first  subsection  defines  different  types  of 
headings  and  the  associated  termination  conditions. 
Section  4.2  uses  the  spatial  representations  of  Section  2 
and  the  heading  structures  to  create  algorithms  that 
can  guide  a  robot  through  the  world  based  on  visual 
information.  The  last  subsection  details  rules  that 
implement  perceptually-based  path  planning  and  fol¬ 
lowing,  assuming  there  are  no  errors  in  re-acquiring 
landmarks,  but  not  assuming  that  all  landmarks  are 
necessarily  re-acquirable. 


4.1  HEADING  TYPES 

A  heading  is  defined  to  consist  of: 

•  type 

•  sensor-centered  coordinate  system  vector 

•  destination-goal 

•  direction-function 

•  termination-criteria 


The  type  of  a  heading  specifies  the  coordinate 
system  that  the  direction  conditions  are  computed  in. 
A  metric-heading-type  corresponds  to  an  absolute  coor¬ 
dinate  system.  A  metric  heading  can  be  induced  by  a 
correspondence  of  sensor  position  to  a  priori  map  or 
grid  data,  from  an  inertial  sensor,  geo-satellite  location, 
dead-reckoning  from  a  known  initial  position,  etc.  A 
viewframe-heading-type  refers  to  headings  computed 
between  viewframes  that  share  common  visible  land¬ 
marks.  This  corresponds  to  reasoning  within  a  local 
landmark  coordinate  system.  An  orientation-heading- 
type  is  a  coordinate-free  heading  based  on  observed 
relationships  with  LPB's.  Note  that  crossing  I.PB's 
corresponds  to  entering  new  orientation  reg’jns. 

Destination-goals  are  descriptions  of  places  that 
the  heading  is  intended  to  point  the  robot  or  vision  sys¬ 
tem  platform  toward.  A  destination-goal  may  be 
specified  as  any  combination  of: 

•  a  set  of  absolute  (e.g.,  UTV1)  world  coordinates 

•  a  viewframe-localization 

•  an  orientation-region 

•  a  set  of  (simultaneously  visible)  landmarks 

The  types  of  destination-goals  are  listed  above  in  order 
of  increasingly  topological,  i.e.,  more  qualitative, 
representation.  Figure  2-6,  shows  the  relationship 
between  these  levels  of  representation.  A  sensor- 
centered  direction  vector  is  a  choice  of  immediate  head¬ 
ing  for  robot  locomotion.  Range  can  be  supplied  if 
available. 

Direction-functions  accept  runtime  data  at  multi¬ 
ple  levels  of  representation  (i.e..  orientation,  view- 
based,  and  or  metric)  and  return  true  if  the  heading  is 
being  maintained  and  false  otherwise.  Direction- 
functions  are  essentially  predicates,  except  that  they 
may  have  side-effects  such  as  updating  heading  error 
parameters.'  Direction-functions  for  metric  headings 
compare  the  desired  heading  vector  against  that 
returned  from  sensor  readings.  Direction  for  view-frame 
headings  are  given  as  relative  angles  between  the  head¬ 
ing  vector  in  the  sensor-based  coordinate  system  and 
the  observed  landmarks  recorded  in  view-frames. 
Finally,  orientation  directions  are  conjunctions  of  head¬ 
ings  relative  to  LPB's.  They  specify  a  set  of  simultane¬ 
ously  true  conditions  indicating  passage  to  the  right, 
left,  or  between  landmark  pairs. 


455 


Termination  criteria  are  runtime  computable  con¬ 
ditions  that  indicate  that  if  the  heading  continues  to  be 
maintained,  its  direction-function  can  no  longer  return 
true.  This  can  occur  because  the  heading  has  been 
fulfilled,  meaning  that  we  have  reached  the  destination 
goal  implicit  in  the  direction-function.  For  example, 
this  occurs  if  we  are  at  the  desired  absolute  world  coor¬ 
dinate  location  specified  in  a  metric  heading,  if  we 
recognize  the  set  of  landmarks  and  relative  angles 
corresponding  to  a  viewframe  heading  destination,  or  if 
we  cross  the  LPB's  given  in  the  direction-function  of  an 
orientation  heading.  A  heading  will  terminate  with 
failure  if  we  accumulate  too  much  error  in  an  absolute 
coordinate  system  tracking  scheme  or  if  we  lose  sight  of 
the  set  of  landmarks  required  to  maintain  a  view-based 
or  orientation-region  heading.  Termination  criteria  also 
include  feedback  conditions  from  modules  outside  the 
vision  system,  such  as  a  path-planning  module  that  rea¬ 
sons  about  obstacles,  traversabilitv  of  the  ground  sur¬ 
face,  strategic  concealment,  etc. 

Heading  types,  destination-goals,  direction- 
functions  and  termination  criteria  are  summarized  in 
Table  4-1.  The  following  explains  how  to  compute 
direction-functions  and  termination  criteria  for  basic 
inference  at  the  viewframe  and  orientation  region  levels 
of  spatial  description. 


However,  if  we  are  outside  the  finest  orientation  region 
that  contains  our  viewframe  destination-goal,  or  if  the 
goal  is  only  to  get  to  where  a  set  of  landmarks  are  all 
visible,  then  a  hill  climbing  control-feedback  approach 
will  fail  because  we  will  have  to  cross  LPB’s  to  reach 
our  destination-goal,  and  the  relative  angles  between 
landmarks  will  vary  non-monotonically. 

An  alternative  approach  is  to  formulate  the 
viewframe  localizations  for  each  of  the  viewframes  in 
the  common  local  coordinate  system  defined  by  the  two 
(or  three)  landmarks,  and  the  sensor.  Note  that  the 
the  two  landmarks  must  not  be  co-linear  with  the  sen¬ 
sor  (the  three  must  not  be  co-planar  with  the  sensor). 
We  then  have  two  regions  expressed  in  the  same  coordi¬ 
nate  system.  Any  affine  linear  transformation  that 
maps  one  viewframe  approximately  onto  the  other  may 
be  taken  as  a  heading.  For  example,  we  can  translate 
the  centroid  of  the  first  viewframe  to  the  centroid  of 
the  second.  This  is  the  definition  of  viewframe  heading 
transformation  we  use  in  this  paper.  It  defines  a  head¬ 
ing  as  a  vector  pointing  between  the  viewframes,  and 
supplies  the  vision  system  with  an  intuitive  notion  of 
"‘head  thataway”.  A  viewframe  heading  is  not  gen¬ 
erally  the  same  as  a  metric  heading,  because  the  points 
in  space  that  the  sensor  occupied  when  the  viewframes 
were  collected  may  not  be  mapped  onto  each  other  by 
the  viewframe  heading  transformation. 


Viewframe  headings  can  be  computed  between 
two  viewframes  that  share  at  least  two  landmarks  (for 
2D  reasoning,  three  landmarks  for  3D  reasoning)  in 
common.  If  the  set  of  landmarks  included  in  a 
viewframe  destination  are  visible  and  we  are  very  close 
to  our  original  point  of  observation,  then  we  can  neces¬ 
sarily  (up  to  traversabilitv  of  the  intervening  terrain) 
perform  a  hill-climbing  algorithm  to  bring  the  sensor  to 
the  point  of  observation  where  the  viewframe  was  pre¬ 
viously  collected.  This  can  be  accomplished  by  dynam¬ 
ically  computing  a  path  based  on  control-feedback  from 
the  relative  angles  between  the  observed  landmarks. 


Table  4-1:  Heading  Specifications 


Figure  4-1  illustrates  the  generic  situation  in 
which  we  can  compute  a  viewframe  heading.  One 
viewframe  contains  the  landmarks  L  l  to  L  6,  and  the 
other  viewframe  contains  landmarks  L  3  to  L  9.  Range 
estimates  to  L3,L4,L5,  and  L  B  may  be  different  in  the 
two  viewframes.  Using  the  LPB  connecting  L  3  and  l  4, 
we  can  assign  a  local  orientation  to  the  vector  pointing 
from  L  3  to  L  4.  This  vector,  with  the  implied  orienta¬ 
tion  of  the  ground  plane,  defines  a  local  2D  coordinate 
system. 

Suppose  that  we  have  computed  the  viewframe 
localization  for  the  first  viewframe  in  terms  of  the  coor¬ 
dinate  frame  defined  by  L  ,  and  L  2,  and  the  second 
viewframe  localization  in  terms  of  L  2  and  L  3.  Our 
problem  is  to  determine  the  linear  transformation  that 
carries  the  L,-L2  coordinate  system  into  the  L  2~  L  } 
coordinate  system.  Because  we  have  range  intervals  to 
the  landmarks,  this  is  an  ambiguous  computation.  If 
the  range  from  the  sensor  to  the  landmarks  were  fixed, 
the  transformation  for  a  point  r  j,</>  is  given  by: 

r  j ,0  -*  Jr  Q 

If  we  systematically  vary  the  locations  of  L  v  L  2 
and  L  3  over  their  range  estimate  intervals,  we  obtain  a 
continuous  family  of  possible  local  coordinate  transfor¬ 
mations.  However,  because  we  intersect  the 
transformed  regions,  the  worst  localization  we  can 
obtain  is  the  best  localization  from  any  pair  of  land¬ 
marks. 

Termination  criteria  for  a  viewframe-heading 
include  that  the  destination  goal  is  reached,  that  we 
have  traveled  the  approximate  distance  to  the  goal 
predicted  in  the  affine  transformation,  or  that  hill 
climbing  fails. 


456 


VIEWFRAME  t  VIEWFRAME  2 


Figure  4-1:  Viewframe  Heading 


Orientation-headings  are  conjunctions  of  specifiers 
for  crossing  LPB’s.  Recall  that  for  the  LPB  formed 
from  L  ,  and  L  2,  i.e.,  LPB(Ll(L2),  the  two  sides  are 
denoted  L  ,  L ,  and  L2Ll  respectively.  The  land¬ 
marks  i,  and  L2  divide  LPB (L  VL  2)  into  three  sec¬ 
tions.  If  we  cross  LPB(L,,£2)  from  the  \L  x  l2  side, 
we  can  cross  to  the  left  of  £  ,,  between  the  landmarks, 
or  to  the  right  of  L  2.  These  crossings  correspond  to 
the  visually  observable  events  of  Lx  occluding  L2  (or 
having  identical  azimuth  angle  in  the  sensor  centered 
coordinate  system),  L  ,  and  i2  being  separated  by  180 
degrees,  or  L  2  occluding  L  r  We  denote  these  three 
possibilities  by  l  \L  ,  L  2j,  A  [L  t  L  2  and  r  L  (  L  2  . 
Thus,  for  example,  If-,  L2  and  r  [L  2  £,  are  cross¬ 
ings  of  the  same  section  of  LPB(£  2),  but  in  opposite 
directions.  We  also  add  the  orientation-heading  of 
moving  toward  a  fixed  landmark.  We  denote  this  head¬ 
ing  with  the  symbol  “a  ”,  and  point  out  that  we  can 
place  this  heading  in  a  uniform  representation  with  the 
others  by  the  artifice  of  using  a  L  t  L  t  to  mean  "head 
toward  landmark  i  Note,  of  course,  that  L  (  cannot 
be  crossed,  but  rather  it  is  reached.  We  can  view  LPB 
attainment  as  reaching  rather  than  crossing  also,  so 
this  is  not  incompatible. 

An  orientation-destination-goal  is  a  conjunction 
of  LPB  crossing  specifiers,  with  no  more  than  one  cross¬ 
ing  specifier  per  LPB  in  the  conjunction.  The  basic  ter¬ 
mination  condition  corresponding  to  an  orientation- 
heading  is  that  all  crossings  have  occurred.  Termina¬ 
tion  can  also  occur  if  it  is  impossible  to  proceed 
without  re-crossing  an  already  crossed  LPB,  or  if  none 
of  the  LPB's  are  visible. 

A  top  level  orientation-heading  consists  of  a  goal 
to  execute  the  crossings  to  get  on  the  correct  sides  of 
the  LPB's  corresponding  to  visible  landmarks  that  are 
also  listed  in  the  destination-goal.  For  example,  sup¬ 
pose  L !,  L  2,  l3  and  L4  are  visible,  and  the 


destination-goal  includes  [L  2  L  t  AND  [L  A  L  3  .  Then 
we  must  cross  both  LPB’s.  It  is  straightforward  to 
select  a  single  heading  that  will  accomplish  all  cross¬ 
ings.  For  example,  if  the  viewframe  order  is 
L2,L^LvL  3,  then  a  heading  of  b  L  4  L ,!  will 
accomplish  crossing  both  LPB(L  ,,L  2)  and  LPB(£  3,L  4). 
A  production  system  can  be  used  to  perform  this  rea¬ 
soning  about  the  order  of  landmarks  in  a  viewframe  to 
compute  a  heading-direction. 

The  usefulness  of  the  production  system  is  that  it 
allows  us  to  reduce  a  conjunction  of  LPB  crossings  to  a 
single  LPB  crossing  condition;  if  relevant,  the  produc¬ 
tion  system  deduces  that  the  specified  conjunction  of 
headings  is  impossible. 

The  production  system  theory  is  based  on  a 
binary  operation  between  pairs  of  orientation-headings. 
Because  the  operation  is  associative  and  commutative, 
the  production  system  can  be  executed  on  a  conjunc¬ 
tion  of  orientation-headings,  two  at  a  time,  in  any 
order.  Therefore,  we  can  front  the  production  system 
by  a  pre-processor  that  identifies  the  viewframe  relative 
azimuth  angle  order  of  the  set  of  landmarks  involved  in 
the  orientation-heading  conjunction.  For  a  pair  of 
orientation  headings,  there  are  six  relative  angle  states 
that  the  four  landmarks  in  the  pair  have.  These  are: 
that  the  landmarks  of  the  two  pairs  are  perceived  in 
left  to  right  order,  the  first  two  being  the  two  from  the 
first  orientation-heading  followed  in  azimuth  angle  by 
the  two  landmarks  from  the  second  orientation¬ 
heading;  or  that  the  landmark  pairs  “overlap”,  the  first 
of  the  first  pair  being  followed  by  the  first  of  the  second 
pair,  then  by  the  second  of  the  first  pair;  and  finally 
the  second  of  the  second  pair;  or,  the  landmarks  of  one 
pair  could  be  nested  between  the  landmarks  of  the 
other  pair.  These  six  possibilites.  and  the  correspond¬ 
ing  productions  for  combination  of  headings  for  two 
landmark  pairs  A  B'  and  CD  are  listed  in  Figure 
4-2.  Notice  that  many  of  the  heading  combinations  are 
impossible. 


VIEWFRAME  ORDER  OF  LANDMARKS 

(A.B) 

cross 

specifier 

(C.D] 

cross 

specifier 

ABCD 

ACBD 

ACDB 

CABD 

CDAB 

CADB 

r 

PBEl 

ran 

■ULil 

ata«l 

r 

1 

CKDEl 

- 

*• 

- 

- 

r 

b 

Baa 

Uiiil 

r 

a 

pagi 

\ 

r 

- 

- 

ebbi 

1 

1 

■cxn 

■imi 

■lUtll 

■EBB 

USE! 

masa 

1 

b 

-- 

Baa 

utaai 

Baa 

1 

a 

BOB! 

b 

r 

laiaui 

BUJ 

Baa 

b 

1 

Bixa 

ebbi 

misi 

b 

b 

Baa 

EEEB 

Boa 

BSEa 

b 

a 

- 

EBBI 

-- 

a 

r 

*31LI1 

a 

1 

■3001 

-- 

a 

b 

a 

a 

- 

*• 

Figure  4-2:  Orientation  Heading  Binary  Productions 


457 


The  case  when  we  are  already  on  the  goal  side  of 
an  LPB  requires  more  involved  reasoning.  We  want  to 
cross  certain  LPB’s  without  crossing  the  ones  we  are 
already  on  the  correct  side  of.  Without  (implicit  or 
explicit)  range  information,  we  cannot  tell  which  LPB 
we  will  cross  first  (on  any  heading);  see  Figure  2-5. 
One  approach  is  to  pick  a  heading  corresponding  to  the 
conjunctive  conditions  in  the  orientation-destination- 
goal  that  we  want  to  change,  and  dynamically  estimate 
the  rate  of  change  of  the  relative  angles  between  land¬ 
marks  as  the  heading  is  maintained.  A  control- 
feedback  approach  can  be  used  to  estimate  the  order 
that  LPB's  will  be  crossed.  Specifically,  over  a  short 
period  of  time,  typically  less  than  a  minute,  we  can 
project,  based  on  the  rate  of  change  of  the  angle 
between  the  landmarks  in  an  LPB,  how  long  it  will  be 
before  that  angle  reaches  zero  or  180  degrees,  and 
adjust  our  heading  accordingly. 


4.2  LANDMARK-BASED  ENVIRONMENTAL 
REPRESENTATION 

A  natural  environmental  representation  based  on 
viewframes  recorded  while  following  a  path  is  given  by 
two  lists,  one  list  of  the  ordered  sequence  of  viewframes 
collected  on  the  path,  and  another  of  the  set  of  land¬ 
marks  observed  on  the  path.  We  call  the  viewframe 
list  a  viewpath.  The  landmark  list  acts  as  an  index 
into  the  viewpath,  each  landmark  pointing  at  the 
observations  of  itself  in  the  viewframes.  For  efficiency, 
the  landmark  list  can  be  formed  as  a  database  that  can 
be  accessed  based  on  spatial  and/or  visual  proximity. 
Visual  proximity  can  be  observed,  or  computed  from  an 
underlying  elevation  grid  an  i  a  model  of  sensor  and 
vision  system  resolution. 

Recall  that  the  primary  component  of  an 
orientation-region-heading  is  a  conjunction  of  LPB  pas¬ 
sage  specifiers.  Therefore  we  use  an  environmental 
representation  for  orientation-region  reasoning  that  is  a 
list  of  oriented  LPB's  encountered  and  crossed  in  the 
course  of  following  a  path.  We  call  such  a  list  an 
orientation-path.  As  with  viewpaths,  there  is  an  associ¬ 
ated  landmark  list  that  indexes  into  the  orientation- 
path. 

A  dynamically  acquirable  environmental  represen¬ 
tation  that  merges  the  representations  for  viewpaths 
and  orientation-paths  consists  of  an  ordered  list 
interspersing  viewframes,  LPB  crossings,  and  appear¬ 
ance  and  occlusion  (or  loss  of  resolution)  of  landmarks, 
as  well  as  recording  the  headings  taken  in  the  course  of 
following  the  path  over  which  the  environmental  map 
is  being  built.  Thus,  we  can  integrate  the 
representations  required  for  viewframe  and  orientation 
region  based  reasoning  with  heading  and  landmark 
information  to  formulate  an  environmental  representa¬ 
tion  that  supports  hybrid  strategies  for  navigation  and 
guidance.  The  representation  is  formed  at  runtime  and 
consists  of  multiple  interlocking  lists  of  sequential,  time 
ordered,  lists  of  visual  events  that  include  those  neces¬ 
sary  for  the  algorithms  presented  in  Section  4.3. 

A  landmark  list  is  separately  maintained.  It 
encodes  proximity  information  between  landmarks,  and 
acts  as  an  index  into  the  dynamic  environmental  path 


representations.  The  data  structure  of  the  environmen¬ 
tal  representation  is  pictured  in  Figure  4-3. 

The  first  occurrence  of  a  landmark  points  at  the 
instantiated  schema  in  the  vision  system  database  that 
was  used  to  gather  evidence  in  the  landmark  recogni¬ 
tion  process.  After  that,  all  recognized  re-occurrences 
of  this  landmark  point  back  at  this  initial  instance. 
The  same  is  true  for  the  first  occurrences  and  successful 
re-recognition  of  LPB’s  and  viewframes.  This  mechan¬ 
ism  allows  multiple  visual  path  representations,  built  at 
different  times,  to  be  incrementally  integrated  together 
as  they  are  acquired  by  using  a  common  landmark 
indexing/ pointer  list. 

; ;  Viewpaths  and  sighting  events 

(def struct  viewpath 

date-initiated 
time-initiated 
start-location 
initial-destination-goal 
event-sequence-vector ) 

(def struct  viewpaths-event-id 
viewpaths-occurs-on 
number-of -event-on- viewpath 
time-of -observation 
original-event-instance 
first-sighting -since- last-disappearance 
previous-occurrence 
next-occurrence) 

(defstruct  (viewframe 

(rinclude  viewpaths-event-id)) 
ordered- landmark-sightings 
range-intervals 
relative- sol id-angles 
relative-angular-errors 
relative-azimuth-angles 
ordered-or ientation-reversals ) 

(defstruct  (original -landmark- instance-event 
(: include  viewpaths-event-id)) 
landmark-name 
schema- instantiation ) 

(defstruct  ( landmark-sighting-event 

(••include  viewpaths-event-id)) 
landmark-name) 

(defstruct  ( landmark-disappearance-event 
(: include  viewpaths-event-id))) 
landmark-name) 

(defstruct  ( lpb-sighting-event 

(: include  viewpaths-event-id)) 
lpb-name) 

(defstruct  ( lpb-disappearance-event 

(: include  viewpaths-event-id)) 
lpb-name) 

(defstruct  ( lpb-crossing-event 

(:include  viewpaths-event-id)) 

lpb 

crossing-specifier) 

Figure  4-3:  Environmental  Representation 
Data  Structure 


458 


4.3  VISUAL  PATH  PLANNING 
AND  FOLLOWING 

The  top  level  loop  for  landmark-based  path  plan¬ 
ning  and  following  is  to: 

1)  determine  a  destination-goal 

2)  compute  and  select  a  current  heading 

3)  execute  the  heading  while  building  up  an 
environmental  representation 

The  destination-goals  are  typically  determined  recur¬ 
sively,  implementing  a  recursive  goal-decomposition 
approach  to  perceptual  path  planning. 

In  this  section  we  give  two  algorithms,  presented 
as  sets  of  rules,  for  performing  this  top  level  loop  for 
the  cases  of  viewframe-based  path  following  and  for 
path  constraints  through  orientation  nets.  Note  that 
algorithms  for  path  planning  and  following  over  metric 
representations  already  exist;  see,  for  example,  Linden 
et.al.  -  86  .  A  rule-based  implementation  of  these  algo¬ 
rithms  permits  hybrid  perceptual  path  planning  and 
following  strategies,  that  can  respond  opportunistically 
to  the  available  data  and  the  certainty  of  the  vision 
system  output. 

inference  over  viewframes  performs  path  planning 
and  following  over  a  visual  memory,  and  therefore 
assumes  that  viewpaths,  i.e.,  the  visual  memory,  have 
already  been  collected.  If  they  have  not,  paths  must  be 
planned  and  followed  based  on  metric  data  and 
viewpaths  collected  in  the  course  of  following  those 
paths. 

The  concept  underlying  the  path 
planning  following  strategies  encoded  in  these  rules  is 
to  mix  the  following  approaches  as  knowledge  is  avail¬ 
able  or  can  be  inferred: 

•  find  landmarks  in  common  between  viewframes 
between  point  of  origin  and  viewframe- 
destination  and  compute  vector  (i.e.,  direction 
and  approximate  range)  headings  between 
viewframes 

•  locate  and  get  on  the  correct  side  of  LPB's 
specified  in  an  orientation-destination,  or 

•  associate  visible  and  goal  landmarks  with  map 
data  and  compute  a  metric  heading  between 
current  location  and  goal 

In  the  following  we  present  several  rules  that  effect 
these  strategies  and  then  develop  some  rules  that  mix 
the  approaches  in  attempting  to  reason  opportunisti¬ 
cally.  Notice  that  each  of  these  strategies  provably 
reaches  its  goal,  up  to  the  perceptual  re-acquisition  of 
landmarks  and  the  traversability  of  intervening  terrain. 


if  viewframe  goal  landmarks  visible 
->  compute  viewframe-heading 

if  at  least  one  LPB  has  an  incorrect  orientation 
relative  to  our  viewframe-destination-goal 
then  follow  heading  for  approximate  distance 
estimated  by  the  viewframe-heading 
else  maintain  heading  by  control-feedback 
path  following  on  relative  angles  between 
landmarks 

build  a  new  viewpath  to  destination  goal,  using 
the  existing  landmark  list  where  possible 

if  viewframe  goal  landmarks  not  visible  and  viewpaths 
exists 

—  >  make  a  viewframe  of  the  currently  visible  region 

chain  back  through  viewpaths  until  common 
landmarks  are  located 

chain  forward  through  viewframes  setting  up 
intermediate  destination-goals 
recursively  execute  viewframe  headings  to  reach 
the  destination  goals  corresponding  to 
visible  landmarks 

if  viewframe  goal  landmarks  not  visible  and  no 
viewpath  exists 

—  >  set  goal  to  find  a  metric  heading 


The  advantages  of  orientation  region  based  rea¬ 
soning  over  viewframe  reasoning  are  that  orientation- 
headings  require  less  computation,  and  LPB’s  can  be 
accurately  acquired  while  moving  at  high  speed  relative 
to  the  processing  power  of  the  vision  system.  The 
latter  is  because  we  do  not  have  to  include  the  relative 
angles  between  landmarks  in  LPB’s;  so  if  we  extract  a 
viewframe  from  buffered  images,  this  point  does  not 
apply. 

LPB’s  are  visually  observable  one-dimensional 
subspaces  of  the  ground  surface.  Therefore,  if  we 
record  a  viewframe  each  time  we  cross  an  LPB  (that  we 
are  tracking),  we  should  be  able  to  re-acquire  that 
viewframe  by  searching  along  the  LPB.  We  can  use 
control-feedback  on  the  relative  angles  between  land¬ 
marks  to  locate  the  viewframe  again.  Being  one¬ 
dimensional,  this  search  is  very  efficient. 

Inference  over  orientation-paths,  analagous  with 
viewpaths,  can  be  performed  in  a  visual-memory  of 
LPB's  and  landmarks.  However,  orientation-paths  can 
also  be  approximately  inferred  from  viewpaths.  If  two 
viewpaths  were  sequentially  acquired,  then  all  boundary 
orientation  reversals  between  the  two  viewframes  must 
have  occurred  on  the  path  between  the  viewframes. 
Although  we  cannot  order  them  without  metric  data, 
we  can  still  use  this  partially  ordered  sequence  of  LPB 
crossings  to  reason  over;  in  particular  we  can  use  the 
production  system  for  reduction  of  conjunctions  of 
headings  to  compute  an  orientation-heading  between 
viewframes  that  we  have  already  collected. 


459 


if  orientation  region  goal  landmarks  visible 

—  >  compute  orientation-heading 

control  feedback  on  LPB,  using  match  to 
viewframe  along  boundaries  to  locate 
missing,  but  predicted  landmarks 

build  a  new  orientation-path  to  goal,  using 
existing  landmark  list 

if  orientation  region  goal  landmarks  not  visible  and 
orientation-path  exists 

—  >  chain  back  through  orientation-paths  until 

landmarks  common  with  current  viewframe 
are  located 

chain  forward  through  boundary  crossings 
setting  up  orientation-destination-subgoals 
recursively 

execute  orientation-headings  to  reach  the 
destination  goals  corresponding  to  visible 
landmarks 

if  orientation  region  goal  landmarks  not  visible  and 
no  unexplored  orientation-path  exists 
-->  set  goal  to  find  metric  heading 


In  the  following  we  present  detail  on  methods  for 
tracking  back  through  the  environmental  representa¬ 
tion  to  find  landmarks  that  either  are  common  with 
both  the  current  viewframe  and  the  destination-goal 
landmark  set.  and  for  computing  a  plan  as  an  ordered 
list  of  of  recursively  executable  sub-goals. 

The  key  to  the  search  is  to  use  the  landmark 
database  as  a  random  access  into  the  criss-crossing 
one-dimensional  path  structures  gathered  on  previous 
system  runs.  If  goals  are  given  for  regions  in  which 
there  are  no  environm  mtal  maps  collected,  goals  are  set 
up  to  develop  headings  from  inference  in  metric  data; 
however,  the  metric  L.'.g.,  map)  data  is  not  entered  into 
the  environmental  representations  until  it  is  actually 
perceived.  We  present  the  search  methods  as  a  set  of 
rules  that  operate  o'er  the  environmental  representa¬ 
tion  and  set  up  goal  structures  and  pointers  into  it. 

if  goal  to  chain  current  viewframe  to  viewframe- 
destination-goal 

-  >  if  can  metrically  localize  destination-goal 
viewframe 

—  >  match  viewfr;  ne  against  landmark-database 
access  for  that  metric  localization 
for  each  landmark  in  current-viewframe 
intersect  poi  .ters  to  LPB’s  and  viewframes 
with  set  of  LPB  and  viewframe  pointers 
for  all  landn  arks  in  destination  set 
if  any  intersect  '  ms  are  non-empty 

->  set  destination  sub-goal  to  reach  the  LPB 
or  viewframe  in  the  intersection 
else  chain  from  destination  through 
environmental  paths  until  paths  are 
exhausted  or  a  landmark,  LPB  or 
viewframe  with  current-viewframe 
landmarks  in  it  is  found  and  set  up  a 
goal  to  follow  the  path 
else  match  viewframe  against  entire 

landmark-database  and  repeat  above  with 
best  viewframe  matches 


460 


if  goal  to  metrically  localize  a  viewframe 
-->  if  viewframe  has  grid  attachment 
—  >  use  it 

else  check  if  any  landmarks  in  viewframe  have 
grid  attachments  and  average  them 
else  attempt  to  make  grid  attachment  for 
viewframe  by  matching  landmarks  in  map 
data  with  line-of-sight  reasoning  for 
elevation 

else  chain  back  along  paths  from  viewframe  to 
last  headings  before  viewframe  was  collected 
and  see  if  they  had  metric  headings  and  set 
sub-goals  to  reach  them 
else  return  failure  to  metrically  localize 

if  goal  to  chain  back  from  metric-destination  and  no 
viewpath  or  orientation-path  chain  can  be  found 

—  >  set  metric-heading  until  range  reached  or 

obstacle  encountered 

if  metric-heading  fails  due  to  obstacle  and  viewpaths 
or  orientation  paths  cannot  be  found  from  the  failure 
point 

—  >  backtrack  to  skirt  obstacle  in  direction  of 

heading 

if  goal  to  skirt  obstacle  with  metric  heading  failure 
-->  if  can  match  obstacle  in  map 

-->  plan  passage  on  map  and  set  up  sequence 
of  metric  headings  with  visual  and  range 
termination  conditions 

else  set  skirt-heading  and  set  daemons  to  find 
passage  beyond  obstacle 


5.  RESULTS 


Several  simulations  of  landmark  acquisition 
scenarios  and  extracted  viewframes,  orientation  regions, 
viewframe  localizations  and  viewpaths  from  them  have 
been  implemented.  Figure  5-1  shows  the  simulated 
landmark  data  over  which  viewframe  localizations  have 
been  extracted-  Figure  5-l(a)  shows  an  image  take  at 
the  Martin  Marietta  Autonomous  Land  Vehicle 
development  site.  Figures  5-l(b)-(e)  show  various 
displays  of  the  30  meter  U.S.  Army  Engineer  Topo¬ 
graphic  Laboratories  terrain  grid  data  gathered  over 
the  area  in  the  image.  Figures  5- 1(b)  and  (e)  show  two 
different  perspectives  on  a  wire-frame  view  of  the 
elevation  data.  The  displays  include  paint  for  terrain 
type  overlay,  and  a  building  present  on  the  site.  Figure 
5- 1(c)  is  an  orthographic  representation  of  the  same 
grid  data,  while  Figure  5-l(e)  is  a  painted  perspective 
display.  Here  the  coarse  quantization  of  the  road  area 
in  the  foreground  is  evident.  The  circles  on  Figure  5- 
1(c)  indicate  three  manually  selected  landmark  points 
representing  two  peaks  and  the  building.  The  x-ed  cir¬ 
cle  on  the  right  is  the  location  of  the  simulated  sensor. 

Figures  5-2(a)-(f)  show  perspective  and  ortho¬ 
graphic  views  of  the  viewframe  localizations  obtained 
relative  to  pairs  of  landmarks.  Range  intervals  of  50% 
the  true  range  were  used.  Because  the  landmarks  are 
approximately  50  pixels  from  the  sensor,  this 
corresponds  to  range  intervals  800  meters  long  for  land¬ 
marks  1600  meters  away.  Angular  errors  of  .1  radian 
were  used.  In  a  45  degree  field  of  view  for  an  image 
512  pixels  wide,  this  is  approximately  a  85  pixel  error. 


pt 


Power  Unset 
Reset  (  anrbnarks 
Reset  Vlewpath 
Recalculate  Vlewpatli 


Redisplay  Overhead 
Redisplay  Perspective 
Display  Visibility 
Display  Orientations 


Review  Point 
II.  . I  Point 
!  •  mv  Point 


a 


Sooth 


Number  Lamdnsrl  5:  11 
Nunber  Viewfranes:  4 
Current  Position  (31,57,110) 

Landnarks:  ( 'LM;  MIL  (57,68, 131) >  <LM:  MIL  (62,71, 138)> 
<LM:  MIL  (39,30, 136) >  'LH:  NIL  (5,55,131)>  <LM:  NIL  (29, 
10, 190)  >  <  Lfi :  NIL  (50,19,195)>  <LN:  NIL  (80,53,150)') 
Aeinuths:  (0.023953736  0.9087661  1.8854042  1.4514971  0.5 
0617504  1.0256968  -5.8014927) 

Elevations:  (0.37625897  -0 .06693965593446904d0  -0.102112 
11 839784 795d0  0.85799766  0.21544746  -0. 7815998?36285608d 
0  -0. 60529548326601 2d0) 

Panges:  (35.185223  44.05678  42.76681  33.48134  92.80625  9 
5.02631  63.3793071 


Figure  5-3:  Viewpath  Simulator 


462 


Both  errors  are  far  greater  than  we  expect  in  practice. 
The  intersection  of  the  landmark  pair  localizations, 
resulting  in  the  viewframe  localization,  is  shown  in  Fig¬ 
ures  5-2(g)  and  (h). 

Figure  5-3  shows  an  example  from  the  viewpath 
simulator.  A  360  degree  sensor  is  assumed.  Perspec¬ 
tive  windows  are  used  to  manually  select  landmarks. 
The  overhead  windows  are  used  to  pick  a  path  based 
on  output  from  the  navigation  and  guidance  inference 
rules.  Visibility  (upper  left),  viewframes,  (center),  and 
orientation  regions  (upper  right),  are  calculated  as  the 
path  is  executed.  The  LISP  listener  shows  the  output 
from  the  simulator  for  the  viewframe  pictured  above  it. 
Although  the  simulator  outputs  absolute  ranges  to 
landmarks,  they  are  altered  to  coarse  intervals  or  elim¬ 
inated  before  being  passed  to  the  inference  process. 

Figure  5-4  shows  the  results  of  executing  a  series 
of  orientation  headings  to  create  a  viewpath  through 
the  elevation  data.  At  each  point  it  is  assumed  that  all 
visible  landmarks  are  re-acquired  and  correctly  associ¬ 
ated  to  prior  occurrences. 


6.  SUMMARY  AND  FUTURE  WORK 


A  rigorous  theory  of  qualitative,  landmark-based 
navigation  and  guidance  for  a  mobile  robot  has  been 
developed.  It  is  based  upon  a  theory  of  representation 
of  spatial  relationships  between  visual  events  that 
smoothly  integrates  topological,  interval-based,  and 
metric  information.  The  rule-based  inference  processes 
opportunistically  plan  and  execute  routes  using  visual 
memory  and  whatever  data  is  currently  available  from 
visual  recognition,  range  estimates  and  a  priori  map  or 
other  metric  data. 

Key  on-going  development  tasks  include: 

•  integration  of  the  navigation  and  guidance 
capability  with  the  vision  system,  and 

•  addition  of  a  non-monotonic  reasoning  system 
that  accounts  for  imperfect  re-acquisition  of 
landmarks. 


Other  tasks  include  reasoning  with  extended  land¬ 
marks,  such  as  road,  rivers  and  ridges,  and  inference  of 
shape  and  additional  topological  relationships  such  as 
region  containment. 

ACKNOWLEDGEMENTS 

This  document  was  prepared  by  Advanced  Deci¬ 
sion  Systems  (ADS)  of  Mountain  View,  California, 
under  U.S.  Government  contract  number  DACA76-85- 
C-0005  for  the  U.S.  Army  Engineer  Topographic 
Laboratories  (ETL),  Fort  Belvoir,  Virginia,  and  the 
Defense  Advanced  Research  Projects  Agency  (DARPA), 
Arlington,  Virginia. 

The  authors  wish  to  thank  Angela  Erickson  for 
providing  administration,  coordination,  and  document 
preparation  support. 


REFERENCES 

[Bajcsy  et.al.  -  86]  -  R.  Bajcsy,  E.  Krotkov,  and  M. 

Mintz,  “Models  of  Errors  and  Mistakes  in 
Machine  Perception”,  University  of 
Pennsylvania,  Computer  and  Info.  Science 
Technical  Report,  MS-CIS-86-26,  GRASP 
LAB  64,  1986. 

[Bolles  and  Baker  -  85]  -  R.  Bolles  and  H.  Baker, 
“Epipolar-plane  Image  Analysis:  A  Tech¬ 
nique  for  Analyzing  Motion  Sequences”, 
Proceedings  for  the  Third  Workshop  on 
Computer  Vision:  Representation  and 
Control,  October  13-16,  1985,  pp.  168- 
178. 

[Cao  et.al.  -  86  -  Z.  Cao,  S.  Oh,  and  E.  Hall,  “Dynamic 
Omnidirectional  Vision  for  Mobile 
Robots  ’,  Journal  of  Robotic  Systems, 
Vol.  3,  No.  1,  1986,  pp.  5-17. 

[Davis  -  86]  -  E.  Davis,  “Representing  and  Acquiring 
Geographic  Knowledge”,  Courant  Insti¬ 
tute  of  Mathematical  Sciences,  New  York 
University,  Morgan  Kaufmann  Publishers, 
Inc.,  1986. 

[Eastman  and  Waxman  -  85)  -  R.  Eastman  and  A. 

Waxman,  “Disparity  Functionals  and 
Stereo  Vision”,  Proceedings  Image  Under¬ 
standing  Workshop,  Miami  Beach, 
Florida,  December  9-10,  1985,  pp.  245- 
254. 

[Hebert  and  Kanade  -  85]  -  M.  Hebert  and  T.  Kanade. 

“First  Results  on  Outdoor  Scene  Analysis 
Using  Range  Data”,  Proceedings  Image 
Understanding  Workshop,  Miami  Beach, 
Florida,  December  9-10,  1985,  pp.  224- 
231. 

[Kuipers  -  77]  -  B.  Kuipers,  “Representing  Knowledge 
of  Large-Scale  Space”,  Massachusetts 
Institute  of  Technology,  Artificial  Intelli¬ 
gence  Laboratory  Technical  Report,  AI- 
TR-418,  July  1977. 

[Lawton  -  83]  -  D.  Lawton,  “Processing  Translational 
Motion  Sequences”,  Computer  Vision, 
Graphics,  and  Image  Processing,  Vol.  22, 
1983,  pp.  116-144. 

[Lawton  et.al.  -  86]  -  D.  Lawton,  T.  Levitt,  C.  McCon¬ 
nell,  and  J.  Glicksman,  “Terra'u  Models 

for  an  Autonomous  Land  Vehicle",  IEEE 
International  Conference  on  Robotics  and 
Automation,  San  Francisco,  California, 
April  1986. 

[Linden  et.al.,  -  86]  -  T.  Linden,  J.  Glicksman,  K.  Dove, 
and  Y.  Kanayama,  “DARPA  Auto¬ 
nomous  Land  Vehicle  Project  Semi- 
Annual  Report  to  Martin  Marietta  Cor¬ 
poration",  TR  1085-14,  Advanced  Deci¬ 
sion  Systems,  Mountain  View,  California, 
October  1986. 


464 


[Lucas  and  Kanade  -  84]  -  B.  Lucas  and  T.  Kanade, 
“Optical  Navigation  by  the  Method  of 
Differences”,  Proceedings  Image  Under¬ 
standing  Workshop,  New  Orleans,  Louisi¬ 
ana,  October  3-4,  1984,  pp.  272-281. 

[Matthies  and  Shafer  -  86j  -  L.  Matthies  and  S.  Shafer, 
“Error  Modelling  in  Stereo  Navigation", 
Camegie-Mellon  University,  Computer 
Science  Department,  Technical  Report, 
CMU-CS-86-140,  1986. 

[McDermott  and  Davis  -  84]  -  D.  McDermott  and  E. 

Davis,  “Planning  Routes  through  Uncer¬ 
tain  Territory”,  Artificial  Intelligence  - 
An  International  Journal,  Vol.  22,  No.  2, 
March  1984,  pp.  107-156 

[Massey  -  67]  -  R.  Massey,  “Introduction  to  Algebraic 
Topology”,  Addison- Wesley,  1967. 

[Moravec  -  80]  -  H.  Moravec,  “Obstacle  Avoidance  and 
Navigation  in  the  Real  World  by  a  Seeing 
Robot  Rover”,  Ph.D.  Thesis,  Stanford 
University,  September  1980. 

Pentland  -  85j  -  A.  Pentland,  “A  New  Sense  for  Depth 
of  Field”,  Proceedings  of  the  Ninth  Inter¬ 
national  Joint  Conference  on  Artificial 
Intelligence,  IJCAI-85,  Los  Angeles,  Cali¬ 
fornia,  August  18-23,  1985,  pp.  988-994. 

Schone  -  84  -  H.  Schone,  “Spatial  Orientation  -  The 
Spatial  Control  of  Behavior  in  Animals 
and  Man”,  Princeton  Series  in  Neurobiol¬ 
ogy  and  Behavior,  R.  Capranica,  P. 
Marler,  and  N.  Adler  (Eds.),  1984. 

Smith  and  Cheeseman  -  85]  -  R.  Smith  and  P.  Cheese- 
man,  “On  the  Representation  and  Esti¬ 
mation  of  Spatial  Uncertainty”,  SRI 
International  Robotics  Laboratory 
Technical  Paper,  Grant  ECS-8200615, 
September  1985. 


465 


Interpretation  of  Terrain  Using  Hierarchical  Symbolic  Grouping 

for  Multi-Spectral  Images 


Bir  Bhanu  and  Peter  Symosek 


Honeywell  Systems  &  Research  Center 
3660  Technology  Drive,  Minneapolis,  MN  55418 


ABSTRACT 

An  Autonomous  Land  Vehicle  (ALV)  must  be  able  to 
navigate  through  terrain  in  order  to  accomplish  its  mission 
of  surveillance,  search  and  rescue  and  munitions  deploy¬ 
ment.  To  reliably  label  each  region  seen  in  an  image,  the 
vision  algorithm  must  make  use  of  as  much  a  priori  informa¬ 
tion  as  possible  because  of  the  immense  variability  of  out¬ 
door  scenes.  The  a  priori  information  may  be  of  many 
types,  for  instance,  the  nominal  location  of  each  terrain  class 
in  the  image  as  a  function  of  the  imaging  system’s  orienta¬ 
tion.  A  knowledge-based  algorithm  for  terrain  interpretation 
for  multi-spectral  imagery  is  described.  This  is  a 
significantly  new  approach  to  this  task  because,  unlike  the 
classical  tree  classifier  approaches  used  in  most  of  the 
remote  sensing  literature,  only  knowledge-based  techniques 
are  used  to  classify  image  regions.  This  permits  accurate 
classification  of  the  regions,  despite  imprecise  a  priori  infor¬ 
mation  for  the  feature  values.  The  algorithm’s  design  is 
described  along  with  a  discussion  of  the  new  research  issues 
which  have  been  identified  from  analysis  of  the  algorithm’s 
performance  for  real  multi-spectral  images. 

1.  INTRODUCTION 

The  interpretation  of  terrain  is  of  critical  importance 
to  the  Autonomous  Land  Vehicle  (ALV),  if  it  is  to  survive 
and  carry  out  its  mission  in  a  totally  unstructured  natural 
environment.  In  order  to  perform  the  functions  of  obstacle 
avoidance,  surveillance  and  path  planning,  the  ALV  must 
first  interpret  the  imagery  obtained  from  its  sensors  into  vari¬ 
ous  regions  in  terms  of  their  land  cover,  trafficability  and 
slope.  The  challenges  presented  by  this  task  are  significantly 
greater  than  those  encountered  for  navigation  on  paved  high¬ 
ways  because  of  the  great  variability  of  the  terrain.  There¬ 
fore,  statistical  approaches  to  the  imaee  interpretation 
problem  for  this  case  are  not  sufficiently  robust  without 
adaptive  feature  detection  and  world  knowledge. 

Knowledge-based  techniques  for  region  labeling  can 
tolerate  large  errors  in  feature  data  and  still  produce  mean¬ 
ingful  results.  They  can  be  designed  in  stages  because  of 
their  modular  rule  database:  as  new  contexts  are  discovered 
for  classification  of  features,  the  system  is  reconfigured  by 
the  definition  or  modification  of  a  few  rules.  Knowledge- 
based  systems  are  very  efficient  for  the  task  of  performing 
retrieval  operations  on  large  symbolic  databases  on  the  basis 


of  relational  and  contextual  constraints  on  the  data.  It  is  this 
critical  capability  that  makes  knowledge-based  techniques  so 
useful  for  terrain  interpretation:  The  contextual  information 
that  is  applicable  to  all  natural  environments  from  a  general 
world  model  can  be  used  to  improve  the  algorithm’s  perfor¬ 
mance  and  reliability. 

A  substantial  amount  of  work  has  been  done  on  the 
problem  of  labeling  regions  in  segmented  images  with 
knowledge- based  techniques.  These  systems  have  been  used 
for  photointerpretation  applications,1-5  autonomous  weapon 
delivery  systems,4  and  the  labeling  of  features  in  arbitrary 
urban  scenes.5-7  These  expert  systems  have  several  features 
in  common:  A  database  of  calculated  image  features  is 
matched  with  predicates  of  production  rules,  which  are 

represented  us  logical  statements  of  the  form  "If . then  ...." 

and  a  control  system  that  supervises  rule  activation. 

The  system  developed  by  Nagao  and  Matsuyama2 
uses  a  knowledge  base  representing  relational,  contextual 
and  geometric  constraints  for  the  task  of  region  labeling  for 
multi-spectral  imagery  obtained  from  low-flying  aircraft. 
The  region  boundaries  arc  detected  by  a  variety  of  low-level 
image  segmentation  algorithms  and  the  resultant  information 
is  archived  on  a  blackboard  shared  by  each  of  the  experts  of 
the  system.  Each  expert  is  optimized  for  locating  a  specific 
kind  of  object  or  region.  They  devised  an  approach  for  the 
reliable  classification  of  vegetational  regions  that  is  indepen¬ 
dent  of  the  time  of  year,  using  the  ratio  of  two  distinct  spec¬ 
tral  bands  to  discriminate  the  vegetation  regions  from  the 
non-vegetation  regions.  They  demonstrate  that  knowledge- 
based  techniques  permit  the  reliable  identification  of  houses 
and  roads  in  congested  urban  scenes  where  other  classical 
approaches  normally  fail.  The  approach  that  they  use  is 
essentially  hierarchical  because  the  segmentation  and 
classification  of  micro-level  regions  of  the  image,  such  as 
houses  or  cars,  is  dependent  on  the  preliminary  segmentation 
of  the  region  that  contains  them.  A  region  growing 
approach  is  used  to  identify  large  regions  of  the  image  with 
global  region  properties  such  as  homogeneity  of  multi- 
spectral  intensity  levels  or  elementary  texture  measures,  but 
no  texture  information  is  used  at  this  stage,  which  is  critical 
for  the  detection  of  regions  in  arbitrary  outdoor  scenes. 
Also,  they  do  not  use  any  map  information,  which  is  very 
useful  for  the  segmentation  of  natural  scenes. 

Ohta6  developed  a  hierarchical  region  labeling 


466 


scheme  for  color  images  of  urban  scenes.  The  approach  is 
hierarchical  because  an  initial  plan  image  is  derived  and 
labeled  before  a  more  detailed,  data-directed  segmentation  is 
carried  out.  The  plan  image  is  defined  by  a  region-based 
color  image  segmentation  algorithm.  The  macro-level 
regions  of  the  plan  image  are:  sky,  tree,  building  and  road. 
These  region  categories  are  detected  using  top-down  contex¬ 
tual  and  spectral  constraints.  The  algorithm  is  very  reliable 
and  can  correctly  label  regions  in  urban  outdoor  scenes 
using  only  57  rules.  His  algorithm  employs  a  texture  meas¬ 
ure  for  region  classification,  but  the  measure  is  expressed  as 
a  three  valued  logical  variable:  a  region  is  either  not- 
textured,  textured  or  heavily-textured.  This  approach  is  use¬ 
ful  for  the  discrimination  of  tree  regions  from  sky,  buildings 
and  roads,  which  are  essentially  untextured,  but  the  approach 
is  not  sufficiently  robust  for  the  discrimination  of  the  tex¬ 
tures  found  in  terrain  images. 

Previous  work  on  the  problem  of  scene  labeling  for 
remote  sensing  applications  has  predominantly  been  res¬ 
tricted  to  classical  techniques.  Landgrebe8  and  Swain9  sur¬ 
veyed  the  state-of-the-art  for  this  application  and  identified 
four  generic  approaches  to  this  problem:  1)  Spectral 
Methods,  2)  Spectral/Temporal  Methods,  3)  Spectral/Spatial 
Methods  and  4)  Spectral/General  Scene  Context  Methods. 
Algorithms  from  category  4)  do  employ  auxiliary  informa¬ 
tion,  such  as  digital  map  data,  water  table  depth  maps  and 
ownership  boundaries,  but  no  work  has  been  described  in  the 
literature  on  the  value  of  knowledge-based  techniques  for  the 
classification  of  regions  in  remotely-sensed  imagery.  Region 
labeling  with  a  priori  spatial  constraints  is  accomplished 
with  relaxation  labeling  for  consistency,10  and  measures  of 
image  texture.  Image  texture  may  be  measured  locally  with 
techniques  such  as  the  co-occurrence  matrix11’12  or  globally, 
with  techniques  such  as  production  system  grammers  for 
image  structure.13  The  major  obstacle  to  defining  a 
knowledge  base  for  the  interpretation  of  remotely  sensed 
imagery  is  that  it  is  exceedingly  difficult  to  identify  struc¬ 
tural  rules,  for  this  kind  of  imagery,  that  are  universally 
valid.9 

Goldberg,  et  al14  designed  an  expert  system  for  the 
detection  of  changes  in  the  forests  of  Newfoundland  from 
LANDSAT  imagery.  The  system  is  hierarchical,  with  multi¬ 
ple  tiers  of  experts  applying  domain  knowledge,  in  a  top- 
down  fashion,  to  the  problems  of  detecting  foresting,  forest 
fire  damage  and  disease  infestation.  The  experts  exchange 
information  for  control  and  hypothesis  generation  on  a 
blackboard  message  space.  Knowledge-based  techniques  are 
used  at  the  intermediate  level  of  the  hierarchy  for  the  tasks 
of  change  detection  and  the  interpretation  of  changes  in 
forestry  terms,  but  no  knowledge-based  systems  are  used  for 
region  classification.  The  results  produced  by  these 
knowledge  sources  are  extremely  useful. 

Other  researchers  have  evaluated  Markov  Random 
Field  (MRF)  representations  for  image  texture  for  the  task  of 
multi-spectral  image  segmentation.15  A  Markov  Random 
Field  is  a  compact  representation  of  the  inter-dependencies 
of  the  multi-spectral  intensities  of  pixels  of  a  neighborhood 
region.  These  models  are  very  effective  at  distinguishing 
between  regions  whose  means  are  the  same,  but  whose  tex¬ 


tures  are  different. 

Our  approach  is  modeled  after  the  urban  scene 
interpretation  algorithms  referred  to  previously.5'7  What  dis¬ 
tinguishes  our  work  from  this  previous  work  is  the  use  of 
context-dependent  constraints  in  the  knowledge  base,  such  as 
map  data  or  features  cues  from  a  Landmark 
Recognition/Prediction  algorithm.16  Constraints  imposed  on 
the  region  labeling  process  with  a  priori  information  for  the 
geographic  region  where  the  vehicle  is  traveling  improve  the 
reliability  of  the  knowledge-based  system  significantly.  This 
approach  will  play  a  vital  role  in  the  development  of  a 
totally  autonomous  system,  where  the  vehicle  will  be  able  to 
navigate  without  external  guidance  through  an  unstructured 
natural  environment. 

Interpretation  algorithms  for  remote  sensing  applica¬ 
tions  often  encounter  problems  caused  by  insufficient  spatial 
resolution  of  the  sensor.  The  effect  of  insufficient  spatial 
resolution  is  that  the  spectral  response  observed  at  one  loca¬ 
tion  (generally  a  pixel)  is  a  mixture  of  the  spectral  responses 
of  several  individual  land  categories,  because  the  geographi¬ 
cal  area  subtended  by  a  pixel  contains  a  mixture  of  terrain.8 
This  problem  is  called  the  mixture-pixel  problem.  For  the 
ground-to-ground  ALV  scenario,  a  similar  problem  exists. 
For  this  scenario,  the  resolution  is,  in  fact,  too  high  for 
pixel-based  labeling.  In  general,  the  spectral  response  of  an 
arbitrary  region  of  a  multi- spectral  image  will  not  be 
equivalent  to  the  spectral  signature  of  a  single  category  of 
terrain,  because  observed  regions  will  be  composed  of  multi¬ 
ple  vegetation  and  land  cover  categories.  Therefore,  the  sig¬ 
nature  of  regions  seen  in  multi-spectral  images  will  represent 
the  accumulated  responses  of  several  vegetational  and  geo¬ 
logical  features.  Because  of  the  high  resolution  of  the 
images,  region-based  calculations  are  required  to  defeat  the 
effects  of  sensor  noise.  However,  region-based  measure¬ 
ments  will  be  distorted  by  the  inhomogeneity  of  the  data. 
Constraints  derived  from  context-dependent  knowledge  can 
resolve  this  dilemma.  Expert  systems  are  extremely  efficient 
at  defining  the  target  set  of  several  symbolic  constraints 
applied  to  a  relational  database.  Thus,  knowledge-based  sys¬ 
tems  are  the  most  appropriate  algorithm  for  terrain  interpre¬ 
tation  in  terms  of  efficiency  and  generality. 

The  knowledge-based  methodology,  that  we  have 
labeled  Hierarchical  Symbolic  Grouping  for  Multi-spectral 
data  (HSGM),  is  an  innovation  in  this  field  for  several  rea¬ 
sons.  The  approach  is  hierarchical  and,  therefore,  is  robust 
despite  significant  discrepancies  between  prior  information  in 
the  knowledge  base  and  the  actual  image.  The  first  stage  of 
the  algorithm  is  the  segmentation  of  the  image  into  macro¬ 
level  regions  with  a  gradient-based  technique  that  is  optimal 
for  the  spectral  characteristics  of  the  terrain  categories  that 
the  algorithm  is  designed  to  detect.  The  macro-level  regions 
are:  sky,  forest,  field,  and  road.  The  algorithm  defines  only 
closed  region  boundaries,  and  these  closed  regions  are 
classified  to  one  of  the  four  categories  with  a  knowledge- 
based  approach  which  uses  relational,  locational  and  spectral 
constraints.  The  macro-level  regions  are  further  segmented 
with  the  gradient-based  approach.  For  this  stage  of  the 
hierarchy,  the  gradient-based  algorithm’s  parameters  are  the 


467 


algorithm  5K£ 'Jt"***  **  MuU'-S^  -ages  (HSGM) 


optimal  parameters  for  the  individual  macro-level  region. 
Then,  in  the  next  stage,  the  resulting  subregions  are 
classified  with  a  knowledge  source  that  can  use  contextual 
constraints  for  the  scene’s  structure.  Because  the  knowledge 
base  for  each  macro-level  region  contains  information  only 
for  the  terrain  category  it  classifies,  the  system’s  complexity 
is  greatly  reduced. 

The  algorithm’s  design  will  be  described  in  the  next 
section.  The  principles  of  the  design  of  the  gradient-based 
region  segmentation  algorithm  and  the  approach  to 
knowledge-based  classification  of  regions  are  discussed. 
Experimental  results  obtained  for  real  ALV  Multi-Spectral 
Scanner  imagery  are  discussed  in  Section  III.  The  terrain 
interpretation  algorithm  is  still  undergoing  revisions  as  new 
procedures  for  fine-tuning  its  performance  become  apparent. 
The  current  status  of  the  algorithm  will  be  explained.  Sec¬ 
tion  IV  itemizes  the  important  features  of  this  algorithm  and 
the  contributions  made  to  the  state-of-the-art  of  natural  scene 
interpretation.  The  potential  for  further  development  of  this 
algorithm  for  detailed  terrain  analysis  for  all  categories  of 
natural  terrain  is  also  discussed. 

2.  CONCEPTUAL  APPROACH 

2.1  Problem  Statement 

The  terrain  interpretation  algorithm  is  designed  to 
label  regions  seen  in  a  single  frame  of  multi-spectral 
imagery  to  one  of  several  land  cover  categories,  possibly 
including  information  regarding  the  slope,  trafficability  and 
geological  characteristics  of  the  region.  Examples  of  labels 
which  may  be  assigned  to  regions  are  field  with  grass  and 
chokeberry  vegetation  or  forest  with  hemlock  vegetation 
located  on  class  C  terrain.  It  may  not  be  possible  to  label  a 
region  to  a  single  category,  but  ihe  candidate  set  for  the  land 
cover  should  be  specific  enough  for  a  navigation  system  to 
either  match  the  region  to  a  digital  map  database  or  to  plan 
a  course  across  the  terrain  without  hitting  any  obstacles. 

2.2  HSGM  System  Overview 

The  terrain  interpretation  algorithm  is  subdivided  into 
five  stages  which  are: 

1.  Generation  of  "Plan"  Image  -  The  segmentation  of  the 
multi-spectral  image  into  a  "plan"  image. 

2.  Knowledge-Based  Classification,  Macro-Level  -  The 
classification  of  the  regions  of  the  plan  image  with  a 
knowledge-based  scheme. 


3.  Macro-Level  Region  Segmentation  -  The  segmentation 
of  the  labeled  macro-level  regions  with  gradient-based 
techniques  which  are  specifically  designed  for  those 
regions. 

4.  Knowledge-Based  Classification,  Micro-Level  -  The 
labeling  of  subregions  with  contextual  and  spectral 

knowledge. 

5.  Conflict  Resolution  -  The  revision  of  "plan"  image 
region  boundaries  if  subregions  are  found  at  their  bord¬ 
ers  that  do  not  belong  in  that  macro-level  region. 

The  block  diagram  of  the  HSGM  algorithm  is  shown  in  Fig¬ 
ure  1. 

2.3  Algorithm  Details 

In  the  following  paragraphs,  the  algorithms  that  form 
each  of  the  five  stages  will  be  described  in  detail.  For  each 
of  the  algorithms,  the  features  that  distinguish  our  approach 
from  classical  techniques  will  be  pointed  out. 

2.3.1  Stage  1.  Generation  of  "Plan"  Image  -  As  stated  previ¬ 
ously,  the  algorithms  of  this  stage  define  a  "plan"  image 
which  is  the  basis  of  all  succeeding  region  classifications. 
The  "plan"  image  is  a  coarse  representation  of  the  image 
because  the  algorithms’  parameters  are  tuned  to  detect  only 
larger  features.  Since  the  area  of  these  regions  is  large,  they 
will  suffer  very  little  degradation  from  sensor  noise  or  from 
invalid  algorithm  parameters  and  can  be  extracted  reliably 
from  the  image.6  The  calculation  of  the  "plan"  image  is  also 
a  very  important  data  formatting  step,  transforming  the  raw 
image  data  into  a  symbolic  form  which  is  more  easily  used 
by  a  high-level,  knowledge-based  labeling  algorithm.  We 
have  named  this  innovative  new  approach  Texture -Gradient 
Edge  Linking/Relaxation.  A  block  diagram  of  the  approach 
is  shown  in  Figure  2.  Each  of  its  constituent  algorithms  will 
now  be  explained. 

The  Texture-Gradient  Edge  Linking/Relaxation 
approach  is  more  robust  than  traditional  gradient-based  tech¬ 
niques  for  the  following  reasons:  For  the  traditional 
gradient-based  approach,  a  gradient  image  is  calculated  with 
a  differential  window  operator.  The  resultant  gradient  image 
is  transformed  into  a  binary  image  with  an  appropriately 
chosen  threshold.  If  the  value  chosen  for  the  threshold  is 
too  low,  the  resultant  region  boundary  image  will  be 
degraded  by  noise  boundaries,  output  along  with  the  true 


46B 


Figure  2.  Texture-Gradient  Edge  Linking  Relaxation  algorithm.  TBL  is 
the  Texture  Boundary  Locator,  K  and  N  are  parameters  used  in  the  Tex¬ 
ture  Boundary  Locator  algorithm. 


region  boundaries.  If  the  value  chosen  is  too  high,  the 
region  boundaries  will  be  degraded  by  many  gaps.  One  of 
the  algorithms  of  stage  1,  the  Multi-Spectral  Edge  Linking 
Relaxation  algorithm,  uses  multiples  thresholds  for  the  esti¬ 
mation  of  the  true  region  boundaries.  It  endeavors  to  fill  in 
the  gaps  of  the  highest  threshold  binary  image  by  using  the 

evidence  from  the  edge  points  that  are  output  for  each  of  the 
remaining  thresholds  in  an  optimal  fashion.  In  this  way,  the 
confidence  that  the  edges  of  the  output  binary  image  are  true 
region  boundaries  is  high. 

The  gradient  operators  employed  by  the  algorithm  do 
not  have  to  be  intensity  gradient  operators:  The  only  require¬ 
ment  made  on  the  gradient  image  is  that  it  should  have  a 
large  magnitude  where  major  discontinuities  occur  between 
regions  of  interest  in  the  image.  Because  the  edges  between 
most  regions  in  natural  scenes  are  gradual  transitions  from 
one  region’s  mean  intensity  to  the  neighboring  region’s 
mean  intensity,  we  chose  to  use  the  Texture  Boundary  Loca¬ 
tor  algorithm  to  define  the  gradient  images. 

This  algorithm  calculates  the  texture  gradient  of  an 
image,  which  is  the  local  rate  of  change  of  a  textural  attri¬ 
bute  of  the  image.  The  textural  attribute  is  derived  from  the 
mean  p  and  standard  deviation  a  of  each  N  x  N  window  of 
the  image,  where  N  is  obtained  as  a  function  of  the  image 
features  to  be  detected. 

The  texture  gradient  is  calculated  as  follows:  A 
2K+1  by  2K+1  window  is  centered  at  each  pixel,  where  K  is 
a  function  of  the  size  of  the  region  of  interest  and/or  range 
information.  The  window  geometry  for  an  arbitrary  image 
plane  location  P  is  shown  in  Figure  3.  The  pixels  at  the 
centers  of  the  four  sides  and  the  comers  of  the  2K+1  by 
2K+1  window  are  labeled  sequentially  beginning  at  the  top 
left  comer  as  shown  in  the  figure.  The  texture  gradient  is 
obtained  as: 


os?s  j{(  ^  +  3  )2  +  (  ~  °/  +  3  )2j  (1) 

The  texture  gradient  is  calculated  only  for  those  mu'ti- 
spectral  images  which  display  sharp  differences  between  the 
means  of  regions  representing  the  four  macro-level  classes: 
sky,  forest,  field,  and  road.  As  shown  in  figure  2,  multi- 
spectral  channels  1,  3,  6,  8,  and  10  are  the  best  images  of 
the  12  for  this  purpose. 

The  criteria  employed  to  select  these  channels  will 
now  be  explained.  Multi-Spectral  Scanner  data  is  collected 
in  12  discrete  bands,  which  ate  listed  in  Table  l.’7  The 
spectral  characteristics  of  each  of  the  regions  of  inter,  st  can 
be  verified  visually  from  the  images  themselves  or  from 
spectral  reflectance  measurements,  such  as  those  distributed 
by  the  U.S.  Army  Engineering  Topographic  Laboratories 


Figure  3.  Texture  gradient  window  geometry  for  Texture  Boundary  Lo¬ 
cator  algorithm. 


Channel 

Spectral  Band  (microns) 

1 

.44  -  .49 

2 

.49  -  .54 

3 

54  -  .58 

4 

.58  -  .62 

5 

.62  .66 

6 

.66  -  .•'0 

7 

.70  -  .74 

8 

.77  -  .86 

9 

.97  -  1.06 

10 

1.0  -  1.4 

11 

1.5  -  1.8 

12 

2.0  -  2.6 

Table  1.  Multi-Spectral  Scanner  Channels 


(ETL).18  The  ETL  data  are  obtained  from  field  measure¬ 
ments  at  the  Martin  Marietta  ALV  Test  Site  with  the  ETL 
spectroradiometer.  They  provide  precise  measurements  of 
the  spectral  behavior  of  several  terrain  categories  in  the  form 
of  graphs  of  radiance  and  of  reflectance  with  respect  to  a 
Halon  Reference  Standard,  as  a  function  of  wavelength. 
These  measurements  may  be  used  to  extrapolate  the  average 
spectral  reflectivities  of  the  regions  of  interest.  These  esti¬ 
mates  are  then  used  to  identify  the  Multi-Spectral  Scanner 
channels  which  will  exhibit  the  greatest  discontinuity 
between  regions  for  all  possible  pairings  of  the  four  classes. 
As  a  result  of  this  analysis,  a  best  discriminator  set  for  the 
four  classes  was  defined.  It  is  presented  in  Table  2.  Table 
2.  defines  the  best  Multi-Spectral  Scanner  channel  or  chan¬ 
nels  that  may  be  utilized  for  terrain  region  boundary  detec¬ 
tion.  The  optimal  channel  for  detection  of  the  region  boun¬ 
dary  between  a  specific  terrain  category  along  the  row 
dimension  and  a  terrain  category  along  the  column  dimen¬ 
sion  is  found  at  the  intersection  of  the  row  and  the  column. 

The  parameters  of  the  Texture  Boundary  Locator 
algorithm  are  set  for  each  image  according  to  the  nominal 
range  of  the  terrain  class  boundaries  to  be  detected  for  each. 
For  instance,  channel  8  was  identified  as  the  best  discrimina¬ 
tor  between  the  forest  and  field  classes.  Because  fields  are 
usually  located  within  5  to  100  yards  of  the  ALV  (as  a  func¬ 
tion  of  the  AJLV’s  current  mission  profile)  the  forest-field 
region  boundary  will  be  wide.  Therefore,  the  parameters  of 
the  Texture  Boundary  Locator  algorithm  used  for  this  image 
are  K  =  9,  N  =  11.  The  values  of  the  parameters  N  and  K 
used  for  the  calculation  of  each  "plan"  image  for  the  experi¬ 
mental  results  presented  in  this  paper  are  shown  in  Figure  2. 

The  evidence  for  region  boundaries  obtained  by  pro¬ 
cessing  each  of  the  five  images  with  the  Texture  Boundary 
Locator  algorithm  are  combined  to  produce  a  single  gradient 
image  Because  the  locations  of  the  image  where  each  of 
the  five  gradient  images  attains  its  maximum  value  are 
essentially  disjoint,  the  gradient  images  may  be  combined 
additively.  Therefore,  each  gradient  image  provides  max¬ 
imum  support  for  the  existence  of  a  true  inter-class  boundary 
for  the  region  boundaries  it  is  designed  to  detect  and  colla¬ 
borative  evidence  for  the  remaining  region  boundaries. 


- 

IEE5I 

Forest 

Field 

Road 

Sky 

Kfl 

1 

1 

1 

Forest 

i 

X 

3 

6 

Field 

i 

3 

X 

8,10 

i 

6 

8,10 

Table  2.  Multi-Spectral  Channels  which  are  optimal  for  the  detection  of 
region  boundaries  between  four  macro-level  classes.  The  optimal 
channel(s)  for  the  detection  of  a  boundary  between  a  class  of  the  row  di¬ 
mension  and  a  class  of  the  column  dimension  is  found  as  the  table  entry 
at  the  intersection  of  the  row  and  column. 


The  next  step  of  stage  1,  the  Multi-Spectral  Edge 
Linking  Relaxation  algorithm,  employs  an  edge  confidence 
image  for  joining  together  incomplete  boundaries.  For  an 
edge  confidence  image,  pixel  values  range  from  A,  to  AN, 
where  AN  signifies  the  highest  confidence  that  a  pixel  is  an 
edge  element.  The  confidence  image  locations  with  the 
intensity  value  AN  are  those  locations  of  the  gradient  image 
whose  intensities  were  equal  to  or  greater  than  the  strictest 
threshold  value  for  that  image.  For  Multi-spectral  Scanner 
imagery,  three  thresholds  at  the  upper  15,  25  and  35  percent 
of  the  confidence  image  intensity  levels  were  found  to  be 
sufficient  to  define  the  true  region  boundaries.  The  use  of  a 
greater  number  of  thresholds  did  not  improve  the  quality  of 
the  detected  boundaries  significantly. 

The  three  steps  of  this  algorithm  are: 

1.  Label  the  maximum  intensity  pixels  of  the  confidence 
image  as  edge  elements. 

2.  Identify  incomplete  region  boundaries.  If  no  incomplete 
boundaries  are  found,  stop;  otherwise  go  to  Step  3. 

3.  Link  the  incomplete  boundaries  found  for  Step  2.  based 
on  local  edge  evidence  and  the  smoothness  of  the  boun¬ 
dary.  Go  to  Step  2. 

The  criteria  employed  for  selecting  new  edge  loca¬ 
tions  for  Step  3,  in  order  of  preceoence,  are  the  following:  1. 
The  edge  location  of  maximum  edge  evidence  of  the  set  of 
neighbors  of  the  endpoint,  2.  The  edge  location  of  maximum 
edge  evidence  of  the  se*  oi  neighbors  of  the  endpoint  that 
causes  the  smallest  change  in  the  curvature  of  the  incom¬ 
plete  boundary.  Because  new  edge  pixels  are  appended  to 
incomplete  boundaries  until  they  become  complete  boun¬ 
daries,  the  region  boundaries  defined  by  this  algorithm  are 
closed.  After  edge-thinning,  statistics  are  calculated  for  each 
region,  and  these  statistics  are  archived  in  a  database  for  use 
by  the  knowledge- based  region  labeling  algorithm. 

2.32  Stage  2.  Knowledge-Based  Classification,  Macro-Level 
-  The  rules  of  the  knowledge-based  region  labeling  algo¬ 
rithm  are  designed  to  locate  regions  of  the  plan  image  whose 
features  are  good  matches  to  known  features  of  five  classes 
of  terrain.  The  features  of  the  five  classes  are  archived  in 
symbolic  form  in  the  knowledge  base.  The  matching  cri- 


470 


tenon  will  be  described  in  the  next  paragraph.  The  five 
classes  of  terrain  are: 

©  =  {  sky,  forest,  mountain,  field,  road  }  (2) 

The  features  used  by  the  region  labeling  scheme  are: 

1.  Spectral  features,  the  mean  and  standard  deviation  of 
each  region. 

2.  Locational  features,  the  correlation  of  the  location  of 
the  region  on  the  image  with  the  expected  location  as  a 
function  of  the  imaging  system’s  orientation. 

3.  Relational  features,  a  set  of  valid  adjacency  rules  for 
neighboring  regions. 

The  matching  criterion  employed  is  a  pseudo- 
Bayesian  measure  for  the  conditional  probability  that  the 
region  observed  is  a  member  of  a  specific  class,  given  the 
features  of  the  region.  The  labels  of  the  five  classes  are:  c,; 
i=l,...,5  for  the  classes  sky,  forest,  mountain,  field  and  road, 
respectively.  The  features  employed  to  classify  regions  to 
one  of  the  five  classes  are:  fjf,  j=l for  class  cf; 

i=l . 5.  If  the  a  priori  probabilities  for  each  class  of  0, 

P  (  c, );  i=l . 5,  are  known,  then  the  conditional  likelihood 

that  a  region  is  an  instance  of  class  c,  is: 

P  (  ci  I  fi  1>  fi  2<  '  •  fi  A  )  = 

P  (Ci)p(fix,fiZ,  •  ,fu,\Ci  ) 

5  (3) 

Z  P  (c*)  P  (fkl>fk2<  '  '  '  •/*/,  I  c*  ) 

k=  1 

where  p  (ft  ,,/j2,  '  '  ,fut\  c, )  is  the  probability 

density  of  the  features  f,j,  j=l, conditioned  on 
the  fact  that  they  are  observations  of  class  ct. 

Because  of  the  immense  variability  of  outdoor 
imagery,  the  features  ft  tf  for  each  i,  of  a  specific  region  are 
asymptotically  independent  as  the  size  of  the  region  becomes 
large.  "Plan"  image  regions  are  large  regions  and,  therefore, 
their  features  can  be  viewed  as  being  independent.  The  pre¬ 
vious  expression  may  be  written  as: 


7, 

P(  ci  )  II  P  (fij  I  ci  ) 

P(  Ci  \fi  I’fil’ ■■■  ’fi  Ji  )  =  — - ^ - (4) 

z'’(c*)rip(/*/ic*) 

*  = 1  7  =  1 

The  initial  probabilities  of  the  classes,  P  (  c,  ); 

i=l . 5,  are  all  equal  to  1  for  our  system.  Equation  (4)  may 

be  used  to  calculate  the  likelihood  that  a  specific  region  is 
actually  an  observation  of  a  specific  terrain  class. 

The  terms  p  (fij  \  C;  );  j=l . i=l,...,5  in  equation 

(4)  are  very  difficult  to  estimate  for  an  arbitrary  multi- 
spectral  image.  To  obtain  meaningful  results  with  a 
Bayesian  approach,  it  is  necessary  to  approximate  the  den¬ 
sity  functions  p  ( \  ct );  j=l, i=l,...,5  as  uniform  likel¬ 
ihood  functions  for  an  appropriate  range  of  feature  values.6 
These  functions  are  often  approximated  as  trapezoidal  densi¬ 
ties.  An  example  of  this  technique  is  the  density  estimated 
for  the  5x5  window  means  of  multi-spectral  channel  3,  for 
the  class  forest,  which  is  shown  superimposed  on  the  actual 
measurements  of  the  feature  in  figure  4. 

Each  of  the  three  categories  of  features  for  this  stage; 
spectral,  locational  and  relational,  can  be  parameterized  as 
conditional  likelihoods.  In  a  strict  probabilistic  sense,  the 
conditional  probabilities  of  relational  features  should  depend 
on  the  assigned  classes  of  both  regions  that  effect  the  rela¬ 
tion.  That  is,  the  likelihood  should  be  a  function  of  the 
form: 

P  (fi  I  ci  <  ck  )  (5) 

where  c,  is  the  class  of  the  region  being  classified, 

and  ct  is  the  class  of  the  neighbor  region. 

Thus,  Bayes  Rule,  as  expressed  in  equation  (4),  does 
not  apply  for  this  feature  and  an  alternative  representation 
for  the  effect  of  the  relational  feature  on  the  conditional 
class  likelihood  for  class  c,  is  necessary.  This  technical 
difficulty  is  avoided  for  the  current  implementation  of  the 
HSGM  algorithm  by  doing  the  following:  The  relational 
rules  of  the  knowledge  base  are  only  applicable  to  adjacent 


471 


pairs  of  regions  for  which  one  of  the  regions  has  been 
classified  with  certainty  to  one  of  the  five  classes.  When  a 
region  is  classified  with  certainty  for  our  system,  all  spectral 
and  locational  rules  and  at  least  one  relational  rule  that 
effect  the  classification  of  the  region  have  been  evaluated, 
and  the  class  label  is  that  class  that  has  the  greatest  condi¬ 
tional  likelihood  as  calculated  with  equation  (4). 

Because  relational  features  have  to  be  evaluated  for 
every  region  in  the  image,  there  must  be  an  initial  starting 
point  where  there  is  at  least  one  region  classified  with  cer¬ 
tainty.  The  terrain  interpretation  algorithm  uses  a  heuristic 
to  solve  this  problem.  All  regions  that  qualify  as  a  sky 
region  with  spectral  and  locational  constraints  alone  are 
labeled  to  the  class  sky  with  certainty.  Then  the  remaining 
regions  are  labeled  with  the  full  set  of  rules.  When  every 
region  has  been  labeled,  the  algorithm  stops. 

2.3.3  Stage  3.  Macro-Level  Region  Segmentation  -  The  third 
stage  of  the  terrain  interpretation  algorithm  is  the 
segmentation  of  each  "plan"  image  region  into  subregions 
with  the  Texture-Gradient  Edge  Linking  scheme  of  stage  1. 
The  only  differences  between  the  implementation  of  this 
scheme  for  this  stage  and  the  implementation  for  the  first 
stage  are:  1.  The  group  of  images  employed  and  2.  The 
coefficients  that  define  the  sensitivity  of  the  Texture  Boun¬ 
dary  Locator  algorithm.  The  set  of  images  used  for  subre¬ 
gion  boundary  detection  for  each  of  the  five  terrain  classes 
are  those  that  exhibit  sharp  discontinuities  between  subre¬ 
gions  of  interest.  The  coefficients  of  the  Texture  Boundary 
Locator  algorithm  are  the  values  that  correspond  to  the  range 
of  the  "plan"  image  region. 

2.3.4  Stage  4.  Knowledge-Based  Classification,  Micro-Level 
-  The  fourth  stage  of  the  algorithm  is  knowledge-based 
subregion  labeling.  The  region  labeling  scheme  is  identical 
to  the  region  labeling  scheme  of  the  second  stage,  except 
that  limited  relational  features  are  used.  The  categories  of 
spectral  features  are  the  same  as  for  stage  2,  with  one 
enhanced  feature;  the  ratio  of  subregion  mean  intensities  for 
pairs  of  multi-spectral  images. 

2.3.5  Stage  5.  Conflict  Resolution  -  The  last  stage  of  the  ter¬ 
rain  interpretation  algorithm  is  the  resolution  of  conflicts. 
The  purpose  of  this  step  is  to  transfer  subregions  from  one 
"plan"  image  region  to  a  neighboring  "plan"  image  region  if 
the  classification  of  the  subregion,  obtained  from  stage  4,  is 
in  conflict  with  the  parent  region’s  classification.  This  step 
is  necessary  because  "plan"  image  region  boundaries  are  fre¬ 
quently  offset  from  the  true  boundary  location  by  as  much 
as  10  pixels,  due  to  the  unsharp  qualities  of  edges  in  outdoor 
scenes.  This  means  that  subregions  that  lie  at  the  border 
between  macro-level  regions  may  be  assigned  to  the  wrong 
region.  This  step  detects  all  subregions  which  may  have 
been  misclassified  because  they  are  adjacent  to  an  incorrect 
"plan"  image  region  boundary,  and  merges  them  into  the 
neighboring  region  if  there  is  no  conflict  with  the  class  label 
of  the  neighbor.  If  the  subregion  cannot  be  classified  as  an 
element  of  either  "plan"  image  region,  it  is  labeled  as  an  ele¬ 
ment  of  the  class  unknown.  This  is  a  useful  technique  for 
detecting  regions  for  which  the  HSGM  algorithm  has  no 
a  priori  information.  Shadows  may  also  be  detected  and 
properly  classified  by  using  physical  world  constraints. 


3.  EXPERIMENTAL  RESULTS 

The  terrain  interpretation  algorithm  has  been  imple¬ 
mented  for  stages  1-3.  Work  at  implementing  stages  4  and 
5  is  on-going.  "Plan”  image  region  classification  results  are 
good. 

The  HSGM  algorithm  is  implemented  on  two  com¬ 
puters  at  the  Image  Research  Laboratory  at  the  Honeywell 
Systems  and  Research  Center.  The  feature  detection  stages, 
stages  1  and  3,  are  implemented  in  the  C  programming 
language  on  a  VAX  11/750  computer  under  the  Berkeley 
UNIX  4.3  operating  system.  The  knowledge- based  region 
labeling  formalism,  stages  2,  4  and  5,  are  implemented  in  a 
pattern-based  reasoning  language  package,  written  in  Com¬ 
mon  Lisp,  on  a  Symbolics  3670.  Data  is  transferred  from 
one  computer  to  the  other  by  means  of  a  ChaosNet  File 
Transfer  Protocol  (FTP). 

The  performance  of  the  first  stage  of  the  HSGM  algo¬ 
rithm  was  evaluated  for  several  Multi-Spectral  Scanner 
images  from  uie  Collage  1  database17  which  was  acquired  at 
the  Martin-Marietta  ALV  Test  Area.  A  few  of  the  experi¬ 
mental  results  obtained  are  described  in  the  following  exam¬ 
ples. 

Example  1.  Figure  5  displays  the  12  Multi-Spectral  Scanner 
images  of  the  image  MULTI  12  ("Segment  O")  from  the  Col¬ 
lage  1  data  set.  The  12  multi-spectral  images  are  shown 
sequentially  going  from  left  to  right  across  the  page,  where 
the  first  row  begins  at  the  top  left  comer  of  the  figure,  the 
next  row  is  the  one  below  it,  etc.  As  a  result  of  a  scanner 
electronics  malfunction,  no  data  is  available  for  channel  9 
for  any  multi-spectral  image  of  Collage  1,  which  is  why 
channel  9  in  figure  5  (  3’rd  row,  l’st  column  )  is  missing. 
Figure  6  is  the  luminance  image  (the  Y  image  of  the  NTSC 
television  standard  for  color  imagery  transmission)  of  the 
multi-spectral  image  MULTI  12  over  which  the  "plan"  image 
region  boundaries  are  superimposed.  The  segmentation  of 
all  terrain  classes;  sky,  forest,  field  and  road,  are  good.  Note 
that  the  foothills  are  accurately  segmented  and  that  the 
occluding  borders  of  hills  in  the  foreground  are  detected  as 
well.  For  the  purpose  of  visual  verification  of  the  results, 
the  luminance  image  without  superimposed  "plan"  image 
region  boundaries  is  presented  in  figure  7. 

Example  2.  Figure  7  is  the  luminance  image  of  the  multi- 
spectral  image  MULT136  ("Segment  V”)  from  the  Collage  1 
data  set,  over  which  the  "plan"  image  region  boundaries  are 
superimposed.  All  major  region  boundaries  of  the  image 
have  been  detected,  except  for  the  left  and  right  forks  of  the 
road.  These  region  boundaries  were  not  detected  because 
they  are  located  at  too  great  a  distance  from  the  ERIM  sen¬ 
sor.  The  range  of  distances  at  which  road  boundaries  may 
be  detected  is  a  function  of  the  parameters  of  the  Texture 
Boundary  Locator  algorithm  for  multi-spectral  images  8  and 
10,  which  are  the  best  channels  for  detecting  road  boun¬ 
daries  (see  Table  2).  Because  the  edges  of  road  boundaries 
for  these  channels  are  not  sharp,  the  coefficients  of  the  Tex¬ 
ture  Boundary  Locator  algorithm  are  set  for  wide  boundaries 
(K-9,  N-ll  for  multi-spectral  channel  8  and  K-ll,  N-13 
for  multi-spectra]  channel  10).  When  road  regions  are 


472 


Figure  6.  "Han"  image  region  boundaries:  MULTI  12. 


Figure  7.  Luminance  image:  MULTI12. 


Figure  9.  Luminance  image:  MULTI  16. 


Figure  8.  "Plan”  image  region  boundaries:  MULTI36. 

located  at  a  great  distance  from  the  ERIM  sensor,  as  the  two 
forks  of  the  road  are,  both  borders  of  the  road  lie  inside  the 
texture  gradient  measurement  window  (see  Figure  3). 
Therefore,  the  texture  in  their  vicinity  will  appear  to  be 
homogeneous  and  the  Texture  Boundary  Locator  will  not 
detect  their  boundaries.  However,  the  Scene  Model/Image 
Processing  Requirements  for  vehicle  guidance  at  a  velocity 
of  10  km/hour,  defined  by  Martin  Marietta,19  are  15  meters 
for  the  ERIM  Multi-Spectral  Scanner.  Because  the  distance 
to  the  fork  in  the  road  is  approximately  30  meters  for 
MULTI36,  the  Scene  Model  that  the  HSGM  algorithm 
defines  is  more  than  adequate  for  road  following.  With 
further  improvements,  such  as  a  knowledge-base  enhanced 
with  elevation  map,  land  cover  map  or  range  sensor  informa¬ 
tion  and  the  ability  to  incorporate  temporal  evidence20  into 
the  region  classification  procedure,  it  may  prove  to  be  satis¬ 
factory  for  cross-country  navigation.  For  the  purpose  of 
visual  verification  of  the  results,  the  luminance  image 
without  superimposed  "plan"  image  region  boundaries  is 
presented  in  figure  9. 


473 


4.  CONCLUSIONS 

A  robust  algorithm  for  the  detection  of  structural 
region  boundaries  for  terrain  images  was  described.  This  is 
a  novel  technique  for  an  application  domain  that  traditionally 
has  been  approached  with  statistical  methods.  The  problem 
of  region  classification  is  solved  with  knowledge-based  tech¬ 
niques  because  of  their  efficiency  for  the  task  of  processing 
symbolic  feature  data.  Regions  of  the  segmented  multi- 
spectral  image  are  labeled  with  an  evidential  reasoning 
approach  because  it  counteracts  the  effects  of  incomplete  or 
inaccurate  knowledge  in  the  knowledge  base.  Experimental 
results  for  macro-level  region  boundary  estimation  are 
described.  The  algorithm’s  current  state  of  development  is 
the  following:  the  first  three  of  five  stages  are  implemented 
and  stages  4  and  5  are  now  being  evaluated  and  optimized. 
The  initial  experimentation  with  the  algorithm  has  permitted 
us  to  acquire  an  appreciation  for  its  potential,  and  on  the 
basis  of  its  strong  performance  for  a  very  difficult  image 
segmentation  problem,  we  plan  to  continue  the  development 
of  this  algorithm. 

REFERENCES 

References 

1.  D.  M.  McKeown,  Jr.,  “Knowledge  Based  Aerial  Photo 
Interpretation,”  Photogrammetria  39  pp.  91-123 
(1984). 

2.  M.  Nagao  and  T.  Matsuyama,  A  Structural  Analysis  of 
Complex  Aerial  Photographs,  Plenum  Press  (1980). 

3.  W.  A.  Perkins,  T.  J.  Laffey,  and  T.  A.  Nguyen,  “Rule- 
based  interpreting  of  aerial  photographs  using  the 
Lockheed  Expert  System,”  Optical  Engineering 
25(3)  pp.  356-362  (March  1986). 

4.  L.  Sauer  and  J.  Taskett,  Cultural  Feature  and  Syntax 
Analysis  for  Automatic  Acquisition,  SPEE  Conference 
on  Processing  of  Images  and  Data  from  Optical  Sensors 
(1981). 

5.  M.  D.  Levine  and  A.  M.  Nazif,  “Low  Level  Image 
Segmentation:  An  Expert  System,”  IEEE  Transactions 
on  Pattern  Analysis  and  Machine  Intelligence  PAMI- 
6(5)  pp.  555-577  (September  1985). 

6.  Y.  Ohta,  Knowledge-based  Interpretation  of  Outdoor 
Natural  Scenes,  Pitman  Publishing,  Inc.  (1985). 

7.  S.  M.  Rubin,  “Natural  Scene  Recognition  Using  Locus 
Search,”  Computer  Graphics  and  Image  Processing 
13  pp.  298-333  (1980). 

8.  D.  A.  Landgrabe,  “Analysis  Technology  for  Land 
Remote  Sensing,”  Proceedings  of  the  IEEE  69(5)  pp. 
628-642  (May  1981). 

9.  P.  H.  Swain,  “Advanced  Interpretation  Techniques  for 
Earth  Data  Information  Systems,”  Proceedings  of  the 
IEEE  73(6)  pp.  1031-1039  (June  1985). 

10.  J.  A.  Richards,  D.  A.  Landgrcbe,  and  P.  H.  Swain, 
“Pixel  Labelling  by  Supervised  Probabilistic  Relaxa¬ 
tion,”  IEEE  Transactions  on  Pattern  Analysis  and 
Machine  Intelligence  PAM  1-3  pp.  181-191  (1981). 


11.  C.  A.  Harlow,  M.  H.  Trivedi,  R.  A.  Conners,  and  D. 
Phillips,  “Scene  Analysis  of  High  Resolution  Aerial 
Scenes,”  Optical  Engineering  25(3)  pp.  347-355 
(March  1986). 

12.  A.  Rosenfeld,  C-Y.  Wang,  and  A.  Wu,  “Multispectral 
Texture,”  IEEE  Transactions  on  Systems,  Man  and 
Cybernetics  SMC-12(1)  pp.  79-84  (January/February 
1982). 

13.  J.  M.  Brayer,  P.  H.  Swain,  and  K.  S.  Fu,  “Modeling 
Earth  Resources  with  Satellite  Data,”  in  Syntactic  Pat¬ 
tern  Recognition,  Applications,  ed.  K.  S.  Fu.Springer- 
Verlag  (1977). 

14.  M.  Goldberg,  D.  G.  Goodenough,  M.  Alvo,  and  G.  M. 
Karam,  “A  Hierarchical  Expert  System  for  Updating 
Forestry  Maps  with  Landsat  Data,”  Proceedings  of  the 
IEEE  73(6)  pp.  1054-1063  (June  1985). 

15.  C.  A.  Therrien,  “An  Estimation-Theoretic  Approach  to 
Terrain  Image  Segmentation,”  Computer  Graphics  and 
Image  Processing  22  pp.  313-326  (1983  ). 

16.  H.  Nasr,  B.  Bhanu,  and  S.  Schaffer,  Guiding  the  Auto¬ 
nomous  Land  Vehicle  Using  Knowledge-Based  Land¬ 
mark  Recognition,  Proceedings  of  the  DARPA  Image 
Understanding  Workshop  (Feb.  1987).  (These  Proceed¬ 
ings) 

17.  J.  A.  Allison,  “COLLAGE:  A  Collection  of  Sensor 
Images  for  the  ALV  Test  Area,”  Martin  Marietta  Corp. 
(December  1985). 

18.  J.  N.  Rinker,  J.  P.  Henley,  and  M.  B.  Satterwhite, 
“Terrain  Data  Base  -  Air  Photo  Analysis,  Martin 
Marietta  ALV  Test  Site,”  Martin  Marietta  Report 
(January  1986). 

19.  J.  Lowrie,  “The  Autonomous  Land  Vehicle  Second 
Quarterly  Report,”  Martin  Marietta  Corp.  (September 
1985). 

20.  B.  Bhanu  and  W.  Burger,  DRIVE  -  Dynamics  Reason¬ 
ing  from  Integrated  Visual  Evidence,  Proceedings  of  the 
DARPA  Image  Understanding  Workshop  (Feb.  1987). 
(These  Proceedings) 


474 


DESIGN  OF  A  PROTOTYPE  INTERACTIVE  CARTOGRAPHIC 
DISPLAY  AND  ANALYSIS  ENVIRONMENT 

Andrew  J.  Hanson,  Alex  P.  Pentland,  and  Lynn  H.  Quam  ' 


SRI  International  (Artificial  Intelligence  Center) 

333  Ravenswood  Avenue,  Menlo  Park,  California  94025 


Abstract 

We  discuss  design  issues  for  an  interactive  scene  modeling 
system  appropriate  for  cartographic  tasks.  Among  the  major 
goals  of  such  a  system  are  the  ability  to  enter  and  display  car¬ 
tographic  features  registered  to  geographic  coordinate  systems, 
to  handle  cartographic  data  base  operations,  and  to  support  au¬ 
tomated  and  semiautomated  feature  compilation.  In  addition, 
the  system  should  be  suitable  for  use  as  a  softcopy,  interactive 
cartographic  display  system  that  could  replace  hardcopy  maps 
for  an  end  user.  A  prototype  for  such  a  system  has  been  imple¬ 
mented  and  is  being  actively  developed.  We  present  some  of  the 
features  and  capabilities  of  the  existing  system,  and  illustrate 
its  use  in  several  scenarios. 

1  INTRODUCTION 

Manual  photointerpretation  of  images  is  a  difficult  and  time- 
consuming  step  in  the  compilation  of  cartographic  information. 
Organizations  that  are  responsible  for  map  generation  are  faced 
with  increasing  volumes  of  data  to  be  processed,  as  well  as 
with  demands  for  an  increasing  variety  of  cartographic  prod¬ 
ucts.  Manual  photointerpretation  techniques  do  not  seem  to  be 
able  to  meet  the  projected  requirements;  thus  it  is  hoped  that 
these  needs  can  eventually  be  met  by  using  digital  image  data 
and  automated  image-understanding  approaches. 

Unfortunately,  it  is  extremely  unlikely  that  complete  automa¬ 
tion  can  be  achieved  with  current  image-understanding  tech¬ 
niques.  Among  the  concepts  that  might  be  used  to  circumvent 
this  problem  in  the  near  term,  and  perhaps  even  in  the  long 
term,  are  the  following: 

•  Computer-Assisted  Enhancement  of  Human  Pro¬ 
ductivity.  We  can  exploit  computer-automated  tools  for 
enhancing  the  productivity  and  capabilities  of  the  individ¬ 
ual  human  photointerpreter.  It  is  reasonable  to  expect  that 
a  division  of  labor  can  be  defined  in  which  decisions  re¬ 
quiring  substantial  expertise  and  knowledge  of  the  context 
are  carried  out  by  the  human,  while  tedious  tasks  requir¬ 
ing  less  abstract  knowledge  can  be  sent  to  the  computer. 
The  human  in  such  a  scenario  plays  the  role  of  a  supervi¬ 
sor  who  directs  the  attention  of  the  automated  processes, 
then  evaluates  and  edits  the  results  to  conform  with  his  or 

ThU  research  was  lupported  principally  by  the  Detente  Advanced  Re- 
tearch  Project*  Agency  under  Contract  No.  MDA903-86-C-0088;  tome 
aspect*  of  the  research  were  also  supported  by  National  Science  Founda¬ 
tion  Grant  No*.  DCR-8313768  and  IST-851 1751. 


her  superior  knowledge  of  the  task.  To  accomplish  the  de¬ 
sired  productivity  enhancement,  a  well-engineered  human 
interface  is  absolutely  essential. 

•  Conversion  to  Softcopy  Cartographic  Viewing  Sys¬ 
tems  Customizable  by  the  End  User.  Hardcopy  car¬ 
tographic  products  are  typically  designed  for  particular 
task  domains.  Attempts  to  economize  by  making  carto¬ 
graphic  products  that  cover  many  different  needs  are  coun¬ 
terproductive  because  extracting  task-specific  information 
becomes  too  difficult  for  users.  By  providing  computer  data 
bases  with  flexible  mechanisms  for  selecting  the  informa¬ 
tion  to  be  displayed,  we  can  permit  the  user  to  create  a 
map  specifically  adapted  for  the  task  at  hand,  while  retain¬ 
ing  the  freedom  to  change  tasks  quickly  within  the  same 
environment.  Furthermore,  since  computer  methods  may 
be  used  to  construct  customized  data  displays,  to  update 
rapidly  changing  cartographic  information,  and  even  to  sim¬ 
ulate  natural  scene  views  ir  real  time,  the  overall  effective¬ 
ness  and  utility  of  softcopy  viewing  methods  could  in  many 
cases  surpass  those  of  hardcopy  cartographic  systems. 

Research  on  computer-based  system  designs  that  can  simulta¬ 
neously  enhance  the  capabilities  of  an  individual  human  analyst 
and  adapt  to  the  cartographic  requirements  of  an  end  user  is 
therefore  of  great  interest. 

In  this  paper  we  discuss  the  design  and  implementation  of  a 
prototype  system  with  the  desired  capabilities.  Major  portions 
of  the  system  have  been  implemented,  tested,  and  rewritten  sev¬ 
eral  times,  so  that  our  concepts  of  the  design  issues  have  been 
significantly  seasoned  by  practical  experience  with  the  proto¬ 
type.  Section  2  gives  an  overview  of  the  design  objectives  and 
the  way  they  have  been  met  in  the  current  system.  Section  3 
presents  several  scenarios  showing  how  the  system  can  be  used, 
these  examples  illustrate  some  techniques  that  are  unique  to  the 
interactive  computer-based  analysis  environment.  We  conclude 
with  a  discussion  of  our  plans  for  future  work. 

2  SYSTEM  DESIGN 

We  believe  that  the  following  specific  capabilities  are  among 
those  needed  in  an  interactive  cartographic  display  and  analysis 
environment  that  meets  the  needs  outlined  above: 

1.  Support  creation,  editing,  interactive  manipulation,  and  re¬ 
alistic  rendering  of  objects  making  up  arbitrarily  complex 
three-dimensional  (3-D)  scenes. 


475 


2.  Allow  such  objects  to  be  merged  with  geographically  precise 
data  bases  of  images,  terrain  models,  and  maps. 

3.  Handle  multiple  overlapping  views  of  the  same  geographic 
area,  including  but  not  limited  to  stereographic  image  pairs, 
while  maintaining  the  correct  geographic  locations  of  all 
interactively-moving  objects  in  each  view. 

4.  Generate  synthetic  scene  views  using  object  data  bases  com¬ 
bined  with  terrain  models,  images,  and  feature  modeling 
knowledge. 

5.  Incorporate  data  bases  of  semantic  object  relationships  and 
domain  knowledge,  along  with  the  ability  to  answer  interac¬ 
tive  set-membership  queries  concerning  data  base  contents. 

6.  Support  automated,  semiautomated,  and  manual  carto¬ 
graphic  compilation  procedures  within  the  system. 

7.  Support  the  incorporation  of  knowledge-based  systems  that 
correctly  deal  with  geographic  and  temporal  relationships 
as  well  as  with  the  context  of  cartographic  compilation. 

The  major  characteristics  of  the  current  system  are  summa¬ 
rized  in  Figure  1.  In  broad  overview,  the  system  works  on  a 
body  of  information  sources  and  provides  the  user  an  elaborate 
set  of  interpretive  tools  with  which  to  manipulate  and  edit  car¬ 
tographic  information  displays. 

2.1  Information  Sources 

Among  the  types  of  information  that  should  be  handled  by 
the  system  are  the  following: 

•  Images.  Digitized  photographs  or  similar  sensor  data  from 
geographic  areas  relevant  to  modeling  and  feature  extrac¬ 
tion  tasks.  Associated  with  the  images  should  be  supple¬ 
mentary  data  such  as  geographic  coordinates,  camera  mod¬ 
els,  illumination  information,  exact  times  that  allow  recon¬ 
struction  of  sun  position  for  outdoor  scenes,  sensor  char¬ 
acteristics,  sensor  response  information,  and  atmospheric 
properties  at  the  time  the  images  were  generated. 

•  Digital  Terrain  Elevation  Data.  When  available,  ter¬ 
rain  elevation  data  should  be  registered  to  the  correspond¬ 
ing  images.  This  permits  the  construction  of  accurate  syn¬ 
thetic  images,  as  well  as  the  generation  of  cartographic 
products  cont  lining  elevation  information. 

•  Compiled  Feature  Data.  Cartographic  feature  data 
bases  are  needed  for  the  generation  of  cartographic  prod¬ 
ucts  meeting  the  needs  of  tasks  for  which  an  image  alone 
is  insufficient.  Ideally,  the  system  should  support  the  inter¬ 
pretation  of  precompiled  feature  data  bases,  the  editing  of 
any  item  in  such  a  data  base,  and  the  interactive  entry  of 
new  data  base  features. 

•  Feature  Models.  In  order  to  support  the  entry  of  new 
features,  a  library  of  predefined  prototype  feature  models 
is  required.  Such  models  would  be  used  either  by  an  ana¬ 
lyst  generating  a  distributable  feature  data  base  product, 
or  by  an  end  user  updating  an  existing  data  base.  Feature 
models  typically  interact  with  utilities  that  allow  them  to 
be  instantiated,  moved  to  correct  geographic  positions,  and 
adjusted  with  respect  to  their  internal  parameters. 


2.2  Interpretive  Tools 

The  interactive  needs  of  the  user  are  met  by  several  subsys¬ 
tems  of  software  tools  oriented  toward  cartographic  compilation 
tasks.  The  interpretive  tools  currently  available  or  planned  in 
the  immediate  future  include: 

•  ImagCalc  (TM).  The  SRI  ImagCalc  system  provides  a 
comprehensive  interactive  environment  for  viewing,  exam¬ 
ining,  and  analyzing  digitized  imagery.  The  interactive  en¬ 
vironment  itself  is  implemented  using  a  complete  library 
of  programmer’s  tools  that  are  available  for  developing 
new  image  processing  application  programs  and  systems. 
ImagCalc  supports  the  use  of  multiple  concurrent  image- 
containing  graphics  windows,  each  of  which  is  associated 
with  a  stack  of  image  displays.  Full  facilities  for  generat¬ 
ing  plots  and  overlay  graphics  are  available.  ImagCalc  also 
supplies  mechanisms  for  accessing  extremely  large  images 
directly  from  disk  storage,  without  requiring  that  the  im¬ 
age  occupy  the  computer’s  possibly  limited  virtual  memory. 

•  SuperSketch  (TM).  The  SRI  SuperSketch  modeling  sys¬ 
tem  [see,  for  example,  Pentland,  1986]  contains  complete  fa¬ 
cilities  for  creating  general  shapes  based  on  superquadric  al¬ 
gebraic  forms  and  their  deformations.  Utilities  are  available 
for  generating  fractal  textures  and  other  realistic  coloring 
and  texturing  in  rendered  scenes.  Composite  objects  with 
negative  components  are  available  to  permit  the  carving  of 
complex  holes  in  a  shape.  The  methods  used  by  SuperS¬ 
ketch  are  particularly  useful  for  the  construction  of  natural 
and  irregular  objects.  Obiect  models  generated  by  SuperS¬ 
ketch  can  be  passed  directly  to  the  cartographic  modeling 
system  and  built  into  cartographic  scenes. 

•  Cartographic  Modeling  Tools.  To  meet  some  of  the 
particular  needs  of  cartography,  the  system  contains  an 
extensive  cartographic  modeling  facility.  Typical  models 
include  3-D  cross-hairs,  3-D  labels,  arbitrary  planar-faced 
objects,  smoothly  shaded  objects  represented  by  triangu¬ 
lar  grids,  digital  terrain  models,  closed-curve  area  delin¬ 
eations,  linear  features,  and  ribbon-like  features.  Various 
types  of  buildings  and  cultural  structures  are  represented 
as  planar-faced  objects,  while  SuperSketch  models  are  rep¬ 
resented  within  this  context  as  a  type  of  smoothly  shaded 
object.  Among  the  facilities  available  are: 

-  Interactive  adjustment  of  3-D  model  parameters. 

-  Precise  manual  entry  of  the  cartographic  coordinates 
of  an  object. 

-  Arbitrary  adjustment  of  the  camera  model. 

-  On-demand  adjustment  of  object  altitude  to  match  the 
elevation  in  the  digital  terrain  model. 

-  Matching  of  object  altitude  to  the  disparity  in  a  stereo 
pair. 

-  Display  of  the  path  of  a  ray  of  sunlight  through  any 
object  vertex  terminating  at  the  terrain  model  surface. 

Many  different  viewpoints  of  a  particular  scene’s  object 
models  may  be  maintained  and  kept  in  view  on  different 
windows  of  the  screen.  Motion  of  an  object  can  be  displayed 
simultaneously  in  multiple  views  so  3-D  alignment  is  achiev¬ 
able  with  or  without  a  stereoscopic  display.  Among  the  dif¬ 
ferent  motion  modes  available  is  the  capability  of  moving 


475 


an  object  along  the  camera  ray  of  a  particular  image,  so 
that  in  other  views  the  object  moves  on  the  epipoles  of  each 
view.  Several  different  digitized  images  can  be  displayed 
with  the  same  geographically  positioned  cartographic  ob¬ 
jects;  this  type  of  interimage  alignment  is  almost  trivial  in 
this  system,  but  can  be  difficult  to  achieve  using  manual 
techniques. 

•  Synthetic  Scene  Generation.  When  a  digital  terrain 
elevation  model  is  available,  any  image  may  be  used  to  gen¬ 
erate  a  synthetic  view  of  the  scene  from  an  arbitrary  cam¬ 
era  position.  This  facility  computes  a  view  of  the  elevation 
model  with  a  resolution-dependent  resampling  of  the  image 
projected  onto  it  [see,  for  example,  Quam,  1985],  Modeled 
objects  can  be  incorporated  into  such  scenes  either  as  wire¬ 
frame  overlays  or  as  simulated  grey-scale  renderings  utiliz¬ 
ing  the  lighting  model.  One  can  construct  simulated  mono- 
scopic  or  stereoscopic  movies  that  greatly  enhance  the  user’s 
perceptions  of  the  scene  by  chaining  together  a  sequence  of 
such  views  and  displaying  them  in  rapid  succession.  For 
certain  tasks,  such  a  simulated  movie  of  the  terrain  appear¬ 
ance  can  be  vastly  more  useful  than  a  conventional  map. 

•  Cartographic  Data  Base  Facilities.  Several  coordi¬ 
nated  efforts  are  now  underway  at  SRI  to  design  data  bases 
appropriate  for  cartography  and  navigation.  While  these 
particular  capabilities  have  not  yet  been  incorporated  di¬ 
rectly  into  the  system,  we  have  implemented  elementary 
data  base  operations  that  allow  the  selection  and  group¬ 
ing  of  cartographic  features.  As  the  data  base  design  work 
matures,  we  expect  to  add  the  new  capabilities  to  the  cur¬ 
rent  prototype.  We  will  then  be  able  to  handle  data  base 
construction  and  interactive  queries  interrogating  the  car¬ 
tographic  data  structures. 

•  Semiautomated  Application  Facilities.  The  capabili¬ 
ties  and  utilities  of  the  existing  system  provide  a  very  flexi¬ 
ble  framework  in  which  to  build  applications.  With  our  cur¬ 
rent  interactive  facilities,  it  is  natural  to  emphasize  appli¬ 
cation  programs  that  have  a  substantial  interactive  compo¬ 
nent.  We  envision  the  system  starting  out  with  the  manual 
interfaces  required  for  many  cartographic  compilation  tasks, 
then  progressively  acquiring  the  subsystems  needed  to  au¬ 
tomate  various  subtasks,  and  finally  being  able  to  achieve 
some  types  of  work  in  a  completely  automated  fashion. 

We  believe  that  providing  an  environment  in  which  to  explore 
the  evolution  from  manual  computer-based  cartography  to  auto¬ 
mated  cartography  is  one  of  the  most  important  features  of  the 
effort  described  here.  With  existing  technology,  certain  tasks, 
such  as  those  depending  on  extensive  background  knowledge, 
common  sense,  and  reasoning,  can  be  carried  out  by  humans 
much  more  effectively  than  by  any  computer  system;  similarly, 
for  many  tasks,  such  as  those  involving  extensive  computation 
and  repetition  of  well-understood  algorithms,  computer  automa¬ 
tion  is  much  more  efficient  than  a  human  could  be.  Finally,  there 
is  a  middle  ground,  where  a  complex  operation  can  be  carried 
out  most  efficiently  using  techniques  that  involve  close  cooper¬ 
ation  between  a  human  operator  and  an  automated  computer. 
The  challenge,  then,  is  to  discover  those  problem  domains  and 
techniques  that  allow  humans  and  computers  to  cooperate  and 
produce  results  that  exceed  the  capabilities  of  either  functioning 
alone. 


3  ILLUSTRATIVE  SCENARIOS 

In  this  section,  we  describe  several  actual  scenarios  that  can 
be  carried  out  in  the  current  implementation  of  this  system. 

3.1  Object  Construction  by  Shadow  Alignment 

For  our  first  task,  let  us  see  how  an  accurate  3-D  building 
model  would  be  constructed  by  exploiting  various  features  of 
the  system.  In  Figure  2a,  we  show  an  image  containing  tall 
buildings;  supplementary  data  for  this  image  include  the  eleva¬ 
tion  model,  solar  illumination  geometry,  and  a  camera  model. 
First,  we  create  a  default  rectangular  building  model  and  use  a 
utility  to  drop  one  corner  to  the  ground  level  as  indicated  by  the 
elevation  model.  Next,  we  interactively  adjust  the  length  and 
width  of  the  building  model  until  they  coincide  with  the  appear¬ 
ance  of  the  building  roof  in  the  image.  We  now  adjust  the  height 
of  the  building.  Using  another  utility  to  drop  a  sun  ray  from  the 
building  corner  to  the  ground  (as  estimated  from  the  elevation 
model),  we  see  the  display  shown  in  Figure  2b;  by  adjusting  the 
height  until  this  ray  coincides  with  the  building  shadow,  we  can 
obtain  a  relatively  accurate  3-D  model  of  the  building.  An  ex¬ 
perienced  user  can  accomplish  this  entire  procedure  in  5  or  10 
seconds. 

A  different  approach  is  available  if  we  make  use  of  the  sys¬ 
tem’s  scene  simulation  capabilities.  Using  the  solar  illumination 
geometry,  we  can  construct  a  camera  model  that  corresponds  to 
an  orthographic  projection  of  the  scene  through  the  location  of 
the  sun  itself.  In  this  view,  motion  along  any  sun  ray  leaves  a 
point  in  the  synthetic  image  invariant.  Similarly,  all  shadows  of 
building  corners  must  appear  exactly  aligned  with  the  building 
corners  themselves,  to  within  the  accuracy  limits  of  the  eleva¬ 
tion  model.  Using  this  “solar”  camera  model,  we  construct  the 
synthetic  scene  view  shown  in  Figure  2c,  and  align  the  build¬ 
ing  corners  with  the  shadows  directly  in  the  synthetic  image. 
In  Figure  2d,  we  show  another  synthetic  view  containing  a  ren¬ 
dering  of  several  3-D  building  models  constructed  using  these 
techniques. 

3.2  Stereographic  Road  Delineation 

Suppose  now  that  we  have  a  stereo  pair  of  images  or  a  distinct 
pair  with  different  camera  parameters,  while  our  task  is  to  enter 
a  curve  that  follows  a  road  in  the  image.  Starting  from  the  image 
pair  in  Figures  3a  and  3b,  we  begin  laying  down  the  points  of 
a  curve  following  the  road.  At  each  point,  we  can  check  the 
appearance  of  the  curve  in  both  pairs  of  images  simultaneously. 
In  particular,  it  is  very  effective  to  use  the  utility  that  fixes  a 
point  of  the  curve  to  lie  on  the  camera  ray  in  one  image  while 
moving  it  'long  an  epipole  in  the  stereo  view.  We  construct  and 
check  each  node  of  our  curve  in  this  way,  and  then  display  the 
pair  using  a  3-D  spline  fit,  as  shown  in  Figures  3c  and  3d. 

3.3  Hand  Segmentation 

Finally,  suppose  our  task  is  to  create  a  manual  segmentation 
of  a  particular  area.  We  begin  with  the  image  in  Figure  4a, 
and  create  a  curve.  By  interactively  inserting,  deleting,  and 
moving  nodes  of  the  curve,  we  construct  the  delineation  shown 
in  Figure  4b,  and  its  spline  fit  shown  in  Figure  4c.  This  is  a 
typical  situation  for  which,  once  a  human  operator  has  provided 
a  good  starting  point,  an  automated  process  can  be  invoked  to 


477 


refine  the  accuracy  of  the  result.  Here,  for  example,  we  use  a 
gradient-ascent  utility  developed  by  Leclerc  and  Fua  [1987]  to 
find  the  best  local  boundary  with  strong  edges  that  is  near  the 
initial  curve;  the  result  of  this  automated  refinement  operation 
is  shown  in  Figure  4d. 

4  PLANS  FOR  FUTURE  WORK 

The  current  implementation  of  this  system  has  many  capabil¬ 
ities  that  are  of  interest  for  manual  and  semiautomated  digital 
cartography.  However,  much  remains  to  be  done  to  explore  and 
evaluate  the  effectiveness  of  the  techniques  that  might  be  used. 
Among  the  specific  efforts  that  we  plan  to  undertake  in  the  near 
future  are  the  following: 

•  Provide  Symbolic  Access  to  Scene  Objects.  Any  car¬ 
tographic  object  can  be  viewed  either  as  a  geographical  en¬ 
tity,  with  a  display  mode  that  is  typically  registered  to  the 
image  in  which  it  appears,  or  as  a  symbolic  entity,  with 
relationships  to  other  features  in  the  cartographic  context. 
Objects  that  are  composites  or  that  are  components  of  com¬ 
posite  objects  are  sometimes  best  manipulated  in  terms  of 
their  logical  identity,  rather  than  their  geographic  one.  To 
meet  this  need,  we  plan  to  develop  a  dual-mode  access  sys¬ 
tem  that  allows  the  user  to  construct  symbolic  graphs  corre¬ 
sponding  to  geographic  feature  clusters,  to  display  symbolic 
graphs,  and  to  interact  with  such  graphs  in  a  way  that  dis¬ 
plays  the  geographic  meaning  of  the  graph  nodes  simulta¬ 
neously  with  the  semantic  meaning.  To  accomplish  this,  we 
will  probably  incorporate  other  SRI  work  on  symbolic  carto¬ 
graphic  data  bases  and  graphical  knowledge  representation 
methods  into  our  cartographic  interpretation  capabilities. 

•  Extend  Scene  Simulation  Cap  bilities.  The  current 
system  supports  only  relatively  simple  rendering  and  sur¬ 
face  characteristics.  One  of  our  next  goals  will  be  to  add 
the  ability  to  use  digital  image  information  from  multiple 
images  in  scene  simulations.  We  will  also  add  prestored 
texture  models  for  certain  types  of  objects  so  that  we  can 
simulate  additional  details  of  objects  with  internal  struc¬ 
ture.  Finally,  we  plan  to  add  the  capability  of  storing  the 
characteristics  of  features  that  are  significant  for  SAR  im¬ 
age  simulation,  so  that  we  can  ultimately  generate  synthetic 
SAR  images  as  well  as  optical  images.  We  also  intend  to 
investigate  the  probltm  of  simulating  scenes  at  times  of  day 
different  from  those  of  any  available  source  image.  This  is 
technically  very  difficult  to  do  correctly  when  texture  map¬ 
ping  from  real  images,  but  some  reasonable  approximation 
techniques  might  be  found  that  would  be  very  helpful  for 
human  viewers. 

•  Support  Semiautomated  Feature  Extraction.  One  of 
the  long-term  goals  of  the  system  is  to  provide  support  for  a 
wide  range  of  manual,  semiautomated,  and  automated  car¬ 
tographic  compilation  procedures.  In  the  short  term,  our 
first  goal  will  be  to  merge  some  current  SRI  work  on  fea¬ 
ture  extraction  into  the  cartographic  modeling  system.  For 
example,  the  building  outlines  produced  by  the  automated 
feature  extraction  system  of  Fua  and  Hanson  [1986,  1987] 
can  be  used  as  initial  building-top  outlines  in  the  modeling 
system;  with  such  a  starting  point,  it  is  straightforward  to 


determine  the  3-D  extent  of  the  building  using  the  shadow- 
driven  techniques  described  above. 

•  Exploration  of  User  Interface  Design.  Among  its  other 
capabilities,  the  system  we  have  developed  is  ideal  for  test¬ 
ing  out  the  concept  that  computer  displays  can  eventually 
replace  bath,  manual  photometric  compilation  and  the  use  of 
hardcopy  maps.  In  particular,  new  ideas  concerned  with  the 
use  of  digital  data  bases  for  the  end-user  assembly  and  soft- 
copy  viewing  of  cartographic  products  can  be  prototyped 
and  tested  in  this  environment.  Our  last  major  objective  is 
therefore  to  continue  to  explore  many  different  concepts  for 
the  user  interfaces  of  such  systems. 


REFERENCE 

P.  Fua  and  A.J.  Hanson,  “Resegmentation  Using  Generic  Shape: 
Locating  General  Cultural  Objects,”  Pattern  Recognition 
Letters  (1986)  in  press;  “Using  Generic  Geometric  Mod¬ 
els  for  Intelligent  Shape  Extraction,”  in  this  Proceedings 
(1987). 

Y.  Leclerc  and  P.  Fua  “Finding  Object  Boundaries  Using  Guided 
Gradient  Ascent,”  in  this  Proceedings  (1987). 

A.P.  Pentland,  “Perceptual  Organization  and  the  Representa¬ 
tion  of  Natural  Form,”  Artificial  Intelligence  28,  pp.  293- 
331  (1986). 

L.H.  Quam,  “The  Terrain-Calc  System,”  in  Proceedings  of  the 
Image  Understanding  Workshop  (1985). 


478 


SRI  SYSTEM  COMPONENTS 


INFORMATION  SOURCES 


INTERPRETIVE  TOOLS 


Figure  1:  Block  diagram  of  the  major  components  of  the  SRI  prototype  cartographic  analysis  and 
display  system. 


to 


(d) 


Figure  2:  (a)  An  image  containing  tall  buildings  with  distinct  shadows,  (b)  The  sun  ray  from  the 
building  corner  being  adjusted  to  the  ground,  (c)  The  building  model  viewed  against  its 
shadow  in  a  synthetic  image  taken  from  the  viewpoint  of  the  sun  itself,  (d)  Another  synthetic 
view  of  the  building  model,  this  time  rendered  into  the  scene  using  a  simple  Lambertian 
reflectance  model. 


480 


(')  (d) 


Figure  3:  (a)  Right  stereoscopic  view  of  terrain  containing  a  road,  (b)  Left  stereoscopic  view  of  terrain 
containing  a  road,  (c)  Right  view  of  the  stereographic  curve  constructed  to  follow  the  road, 
(d)  The  corresponding  left  view  of  the  curve. 


i 


481 


Figure  4:  (a)  An  image  containing  a  region  to  be  segmented  manually,  (b)  The  raw  set  of  straight 
line  segments  generated  during  the  manual  editing  of  the  curve  outlining  the  region,  (c) 
A  3-D  spline  fit  to  the  curve,  (d)  A  region  delineation  computed  using  a  gradient  ascent 
technique  to  find  the  best  boundary  with  a  strong  edge  that  is  near  to  the  manually-entered 
boundary. 


The  Image  Understanding  Architecture 


Charles  C.  Weems,  Steven  P.  Levitan* 

Allen  R.  Hanson,  Edward  M.  Riseman 

Department  of  Computer  and  Information  Science 
University  of  Massachusetts 
Amherst,  MA  01003 


ABSTRACT 

This  paper  begins  with  the  motivation  and  rationale  for 
the  design  of  the  Image  Understanding  Architecture  (IUA); 
a  massively  parallel  system  for  supporting  real-time  image 
understanding  tasks  and  research  in  computer  vision.  A 
small-scale  demonstration  prototype  of  the  IUA  is  currently 
being  constructed  by  the  University  of  Massachusetts  and 
Hughes  Research  Laboratories.  The  remainder  of  this  pa¬ 
per  presents  an  overview  of  the  IUA  and  some  details  of 
the  architecture.  We  conclude  with  a  brief  discussion  of  the 
IUA  operating  and  software  development  environments. 

1.  INTRODUCTION 

Machine  vision  is  one  of  the  most  computationally  in¬ 
tractable  domains  of  artificial  intelligence  research.  It  re¬ 
quires  that  an  interpretation  of  a  changing  scene  be  updated 
with  every  new  video  frame:  once  every  thirtieth  of  a  sec¬ 
ond.  Bach  video  frame  contains  three  quarters  of  a  million 
color-intensity  data  values  which  comprise  the  picture  ele¬ 
ments  (pixels)  of  the  image.  Performing  a  single  operation 
on  each  of  these  pixels  requires  executing  about  23  million 
instructions  per  second  just  to  keep  up  with  the  input.  Of 
course,  the  computation  needed  to  perform  image  interpre¬ 
tation  is  much  more  than  one  operation  on  every  pixel  in 
an  image.  Some  researchers  believe  it  could  be  as  much  as 
6  or  7  orders  of  magnitude  more  computation. 

An  interpretation  is  a  high-level  description  of  the  en¬ 
vironment  from  which  the  image  was  taken.  It  must  be  in  a 
form  that  is  suitable  for  planning  such  diverse  activities  as 
robot  arm  and  hand  motion,  obstacle  avoidance  by  a  vehi¬ 
cle,  or  aircraft  navigation.  Automatic  scene  interpretation 
requires  the  construction  of  at  least  a  partial  description 
of  the  original  environment,  rather  than  a  description  of 
the  image  itself.  It  involves  not  only  labeling  certain  re¬ 
gions  in  an  image,  or  locating  a  single  object  in  the  viewed 


*  Current  address:  Department  of  Electrical  Engineer¬ 
ing,  Benedum  Hall,  University  of  Pittsburgh,  Pitts¬ 
burgh,  PA  15261 


scene,  but  often  requires  a  three  dimensional  model  of  the 
surroundings,  with  associated  identification  in  the  image  of 
the  2-dimensional  projections  of  these  3-dimensional  mod¬ 
els.  Figures  1  and  2  show  a  simple  example  of  an  interpre¬ 
tation  of  a  house  scene. 


Figure  1.  Raw  Image 


Interpretation  is  made  even  more  difficult  by  the  fact 
that  the  image  data  is  inherently  ambiguous  and  incom¬ 
plete.  Let,  us  consider  a  very  difficult  example,  a  window 
viewed  from  the  exterior  of  a  house,  in  order  to  make  our 
point.  One  must  expect  that  the  interior  might  be  visible; 
through  a  portion  of  the  window,  another  portion,  partially 
superimposed  on  the  first,  might  reflect  part  of  the  out¬ 
door  scene,  and  might  contain  several  specular  reflections; 
yet  another  part  of  the  window  could  appear  opaque;  fi¬ 
nally,  the  lower  portion  of  the  window  could  be  obscured 
by  shrubs.  This  example  is  a  situation  that  is  impossi¬ 
ble  to  interpret  with  pattern  recognition  techniques,  be¬ 
cause  the  window  does  not  have  a  unique  pattern.  It  is 
a  collage  of  visual  patterns,  many  of  which  are  only  par¬ 
tially  visible.  While  there  has  been  a  little  success|Faugeras 
and  Price,  1981]  in  applying  a  Bayesian  classification  view¬ 
point  to  some  select  subproblems  in  Computer  Vision,  there 
are  many  difficulties  and  we  believe  standard  statistical  ap- 


483 


Figure  2a.  Example  Image  Interpretation 


proaclies  generally  suffer  from  insoluble  problems.  Classi¬ 
cal  pattern  recognition  techniques  are  not  powerful  enough 
by  themselves  to  produce  effective  classifications  in  the  do¬ 
mains  we  wish  to  consider. 

While  this  example  scene  might  be  beyond  the  current 
capabilities  of  computer  vision,  it  is  clear  that  correct  in¬ 
terpretation  of  the  window  image  would  require  that  we 
take  into  account  the  larger  context  of  the  overall  scene, 
and  knowledge  about  the  world  in  general.  Because  the 
ambiguous  data  is  part  of  a  house,  and  is  positioned  where 
a  window  might  appear,  it  might  be  possible  to  hypoth¬ 
esize  that  it  is  a  window.  Special  processes  may  then  be 
brought  to  bear  to  verify  this  hypothesis.  If  the  hypothe¬ 
sis  is  confirmed,  then  additional  knowledge  may  be  used  to 
fill  in  missing  data  and  use  weak  cues  to  infer  additional 
related  objects  consistent  with  this  context.  For  example, 
the  interpretation  may  state  that  the  lower  portion  of  the 
window,  although  obscured,  is  probably  rectangular. 

Thus,  we  are  inferring  the  presence  of  a  portion  of  the 
window  from  knowledge  about  houses  and  windows  in  gen¬ 
eral.  Inference  via  stored  knowledge  and  the  reduction  o( 
ambiguity  from  “low-level”  image  analysis  are  a  part  of 
what  is  referred  to  as  “high-level”  vision  processing.  With¬ 
out  knowledge-based  processing  it  would  be  impossible  to 
interpret  large  portions  of  many  images.  The  approach  to 
knowledge-based  vision  that  follows  is  derived  from  the  VI¬ 
SIONS  system  development  project  for  the  analysis  of  nat¬ 
ural  scenes  at  the  University  of  Massachusetts  (UMass),[ 
Hanson  and  Riseman,  1974,  1978b,  1986;  Parma  et  al„ 
1980;  Reynolds  et  al.,  1984;  Weymouth  1986|. 

For  high  level  interpretation,  the  principal  unit  of  in¬ 
formation  is  a  symbolic  description  of  an  object,  or  a  set 
of  image  events,  sometimes  referred  to  as  image  features  or 
symbolic  tokens,  extracted  from  the  image.  The  descrip¬ 
tion  includes  relationships  both  to  other  2  dimensional  (2D) 


Figure  2b.  Labeling  Key 


symbolic  tokens  extracted  from  the  sensory  data,  such  as 
lines,  regions,  and  surfaces  and  to  other  objects  in  the  3 
dimensional  scene  being  viewed.  It  also  includes  pointers 
to  elements  of  general  knowledge  that  have  been  used  to 
support  the  interpretation  process.  At  this  level  the  rep¬ 
resentation  involves  semantic  concepts  of  “object-classes”, 
their  sub-classes,  their  “part-decomposition”  and  the  vari¬ 
ous  types  of  relations  between  them. 

Knowledge-based  processing,  however,  can  only  take 
place  after  a  certain  amount  of  low-level  processing  has  oc¬ 
curred.  Low  level  processing  principally  involves  classical 
image  processing  techniques  such  as  contrast  enhancement, 
and  computer  vision  techniques  of  segmentation  and  fea¬ 
ture  extraction,  edge  detection,  and  region  labeling.  At  the 
low  level,  the  principal  unit  of  information  that  is  being 
processed  is  the  pixel,  or  picture  element,  consisting  of  the 
color  or  intensity  values  of  the  image,  and  possibly  range 
data  for  the  visible  surface  element  associated  with  each 
pixel. 

There  is  no  simple  computational  transformation  that 
will  map  arrays  of  pixel  values  onto  the  stored  symbolic 
concepts  represented  in  the  high  level  knowledge  base.  It 
has  become  generally  accepted  |IIanson  and  Riseman,  1980| 
that  many  levels  of  representation  (data  abstraction)  and 
many  stages  of  processing  must  take  place  to  reliably  in¬ 
terpret  a  scene.  We  will  refer  to  the  set  of  representations 
between  the  low  and  high  level  data  structures  as  the  in¬ 
termediate  representation. 

The  intermediate  level  of  representation  bridges  the  gap 
between  the  low  and  high  levels.  At  the  intermediate  level 
the  basic  unit  of  information  is  a  description  of  an  image 
event  extracted  from  the  image  data,  and  referred  to  as  a 
token.  Examples  of  token  classes  (or  types)  are  lines;  in¬ 
tensity,  color  or  texture  regions;  and  surfaces.  Processing 
at  the  intermediate  level  may  then  involve  grouping  these 


484 


token  events  into  more  complicated  structures  such  as  rect¬ 
angles,  planes,  or  sets  of  parallel  lines. 

The  intermediate  representation  is  treated  as  a  database 
by  the  high  level,  which  queries  it  in  order  to  form  initial 
hypotheses  about  the  content  of  the  image.  Following  this, 
verification  of  hypothesis  and  resolution  of  conflicting  hy¬ 
potheses  often  requires  the  high  level  to  initiate  further 
processing  in  the  low  and  intermediate  levels  in  order  to 
extract  additional  information  from  the  image. 

Image  interpretation  may  thus  be  characterized  as  in¬ 
volving  three  different  levels  of  processing,  each  with  its 
own  specific  class  of  information.  Additionally,  those  lev¬ 
els  must  be  able  to  interact  through  bottom-up  transfers  of 
information  (Figure  3)  and  top-down  control  of  processing. 
The  low,  intermediate,  and  high  levels  of  representation  and 
processing,  together  with  our  understanding  of  how  those 
levels  interact,  provides  the  basis  for  the  design  of  our  Im¬ 
age  Understanding  Architecture  (1UA)  presented  in  section 
3. 

2.  PARALLEL  ARCHITECTURES  FOR 
KNOWLEDGE-BASED  VISION 

The  motivation  for  building  a  “vision  machine”  or  im¬ 
age  understanding  architecture  stems  from  the  need  to  sup¬ 
port  a  set  of  complex  operations  on  massive  amounts  of 
data  at  high  speeds,  and  provide  an  environment  for  exper¬ 
imentation  that  justifies  the  enormous  expense  in  software 
that  is  necessary  to  build  a  vision  system.  In  this  section 
we  discuss  the  requirements  for  such  an  architecture  and 
review  some  possible  architectures  and  organizations. 


2.1  Architectural  Requirements  for  Vision 

Some  of  the  architectural  issues  to  be  addressed  for  vi¬ 
sion  come  directly  from  the  specific  requirements  of  the 
problem: 

*  The  ability  to  process  both  pixel  and  symbol  data. 

*  A  fast  processing  rate  for  huge  amounts  of  sensory 
and  intermediate  level  data. 

*  The  ability  to  transform  an  image  into  a  set  of  mean¬ 
ingful  symbols  that  describe  it. 

*  The  ability  to  select  particular  subsets  of  data  for 
varying  types  of  processing. 

*  Feedback  mechanisms  that  allow  focusing  of  atten¬ 
tion  and  data-directed  processing,  without  having  to 
dump  the  image  to  some  “host”  for  external  evalua¬ 
tion. 

Beyond  these  requirements,  two  significant  architec¬ 
tural  implications  can  be  derived  from  an  understanding 
of  our  approach  to  the  computer  vision  problem: 

*  Multiple  levels  of  representation  and  stages  of  pro¬ 
cessing  are  essential  and  require  very  different  types 
of  processing  elements. 

*  Fine  grained  and  high  speed  communication  and  con¬ 
trol  is  required  both  among  the  processes  at  each  level 
and  between  the  different  processing  levels. 

We  first  discuss  the  computational  requirements  needed 
at  different  processing  levels  and  then  move  on  to  discuss 
the  the  communication  and  control  issues  involved  in  vision 
processing. 


Communications  and  Control  Across  Multiple  Levels  of  Representation 

High  Level  -  Schema  -  Symbolic  Descriptions  of  Objects  -  Control  Strategies 
Inference  <  >  Propagation  of  Hypotheses 
Focus  of  Attention:  „  Object  Matching  and  Inference: 


Rule-Based  Object  Hypothesis 
Information  Fusion 


Grouping,  Splitting,  Adding  tc  Deleting 
Regions,  Lines  and  Surfaces 


Intermediate  Level  -  Symbolic  Description  of  Regions,  Lines,  Surfaces 
Perceptual  Organization  Grouping 


Segmentation: 
Feature  Extraction 


Goal-Oriented  Re-segmentation: 
Additional  Features,  Finer  Resolution 


Low-Level  -  Pixels  -  Arrays  of  Intensity,  RGB,  Depth 
.Stereo  Motion 


Figure  3.  UMass  VISIONS  Image  Interpretation  System 


485 


2.2  Processing  Characteristics  for  Multi-Level 
Architectures 

We  have  mapped  the  three  levels  of  abstraction  dis¬ 
cussed  above  into  three  levels  of  architectural  requirements 
with  each  level  having  the  appropriate  architecture  for  the 
set  of  tasks  at  that  level.  The  simplest  level  is  the  low-level 
where  uniform  computation  is  applied  at  local  points,  often 
at  each  pixel,  across  the  image.  At  the  low  level  we  need 
to  perform  operations  such  as  smoothing,  edge  detection, 
region  segmentation,  feature  extraction,  and  feature  match¬ 
ing  between  frames  in  motion  and  stereo  processing.  Here, 
each  pixel  or  a  neighborhood  around  each  pixel,  must  be 
processed  and  there  are  large  amounts  of  local  processing 
to  be  applied  in  SIMD  fashion.  Some  of  these  operations 
may  also  require  processing  over  larger  image  distances. 

In  addition  to  high-speed  low  level  operations,  we  need 
to  be  able  to  quickly  load  the  data  and  to  test  the  results 
of  partial  processing.  To  perform  multiple  iterations  of 
low  level  operations  under  control  of  higher  level  processes 
requires  fast,  concise  feedback  from  the  low  level  processes 
to  the  intermediate  and  high  level  processes  as  well  as  data- 
dependent  control  of  the  low  level  processes  by  the  higher 
ones. 

At  the  intermediate  level  our  concerns  are  for  the  sym¬ 
bolic  tokens  associated  with  extracted  image  events.  The 
types  of  operations  needed  at  this  level  are  partitioning 
and  merging,  which  transform  these  tokens  into  more  useful 
structures  in  conjunction  with  high  level  hypotheses  about 
the  possible  interpretations  of  events  in  the  image. 

The  intermediate  level  also  requires  specialized  proces¬ 
sors.  Consider  the  large  numbers  of  line  and  region  frag¬ 
ments  that  can  be  generated  by  even  the  most  effective 
low  level  algorithms.  To  perform  grouping  operations  (e.g., 
merging  and  splitting  of  regions,  or  linking  and  reorganiz¬ 
ing  lines)  we  need  a  large  amount  of  “less  local”  commu¬ 
nications.  We  need  to  match  fragments  of  lines  and  merge 
them  across  large  fractions  of  the  image.  Similarly,  regions 
need  to  be  merged  and  compared  with  others  from  possibly 
non-contiguous  areas. 

This  data  reduction  through  abstraction  process  is  the 
key  to  intermediate  level  processing.  For  it  to  be  done 
quickly  requires  architectural  support  for  both  inter-level 
and  intra-level  communication  as  well  as  a  flexible  data 
manipulation  repertoire  of  instructions.  Thus,  the  inter¬ 
mediate  level  must  operate  as  a  server  for  the  queries  made 
by  the  higher  level  processes,  support  data  reduction  pro¬ 
cesses,  and  provide  control  and  evaluation  mechanisms  for 
the  lower  level  processes. 

At  the  high  level  we  are  concerned  with  semantic  pro¬ 
cessing  involving  mechanisms  for  focus  of  attention,  the  for¬ 
mation  and  verification  of  object  hypotheses,  and  knowledge 
based  inference  using  complex  control  strategies  from  mul¬ 
tiple  knowledge  sources.  This  type  of  processing  involves 
extensive  symbolic  computation. 


High  level  operations  generate  and  test  hypotheses  based 
on  available  data  provided  by  the  low  and  intermediate  lev¬ 
els  of  processing  and  request  new  data  to  be  abstracted  from 
the  image  if  needed.  Many  hypotheses,  each  dependent  on 
the  results  of  applying  constraint  rules,  need  to  be  run, 
tried,  and  discarded  before  a  consistent  set  can  converge. 
The  computational  and  communication  needs  of  these  pro¬ 
cesses  should  be  provided  by  high  level  processors  which 
form  the  third  tier  of  processing  power  needed  to  solve  the 
image  understanding  problem. 

2.3  Communication  Between  Processing  Levels 

Central  to  vision  processing  is  the  bi-directional  flow  of 
communication  and  control  up  and  down  through  all  rep¬ 
resentation  and  processing  levels.  These  two  capabilities 
allow  us  to  perform  multiple  image  operations,  evaluate 
results  of  that  processing,  and  re-compute  with  different 
parameters  on  different  parts  of  the  image.  This  must  be 
done  in  tenths  of  milliseconds  in  order  for  different  inter¬ 
pretation  strategies  to  be  tried  dynamically  within  a  single 
frame  time. 

In  the  upward  direction,  the  communication  consists 
of  image  abstraction  and  segmentation  results  from  multi¬ 
ple  algorithms,  and  possibly  from  multiple  sensory  sources. 
It  also  involves  the  communication  of  a  set  of  attributes  of 
each  extracted  image  event  to  be  stored  in  a  symbolic  repre¬ 
sentation.  In  addition,  summary  information  and  statistics 
allow  processes  at  the  higher  levels  to  evaluate  the  success 
of  lower  level  operations.  In  the  downward  direction  the 
communication  consists  of  knowledge  directed  processing 
and  grouping  operations,  commands  for  selecting  subsets 
of  the  image  for  specifying  further  processing  in  particular 
portions  of  the  image,  modification  of  parameters  of  lower 
level  processes,  and  requests  for  additional  information  in 
terms  of  the  intermediate  representation. 

Communications  between  levels  may  take  four  possible 
forms:  One-to-one,  one-to-many,  many-to-one,  and  many- 
to-many.  The  first  form  involves  one  process  at  a  given  level 
communicating  directly  with  another  process  at  a  higher 
or  lower  level.  The  second  form  is  typically  a  process  that 
broadcasts  information  to  many  lower  level  processes.  The 
third  form  represents  a  collected  feedback  mechanism  in 
which  information  from  many  lower  level  processes  is  com¬ 
bined  and  a  result  is  passed  to  a  sing't*  higher  level  process. 
The  last  form  involves  the  paralb ,  transmission  of  infor¬ 
mation  between  many  processe*  at  different  levels.  The 
associative  processing  paradigm,  which  will  be  presented 
next,  provides  for  the  first  three  of  these  communications 
mechanisms.  The  remaining  mechanism  will  be  discussed 
at  the  end  of  this  section. 


2.3.1  Associative  Communications  and  Control 

Rased  on  our  experience  with  highly  parallel 
algorithms,  [Levitan,  1986)  we  believe  that  the  best  way 


486 


to  meet  the  requirements  of  high  speed,  fine  grained  com¬ 
munications  and  control  is  with  associative  processing  tech¬ 
niques.  There  are  three  processing  capabilities  that  are  key 
to  associative  computation: 

*  Global  Broadcast  /Local  Compare 

*  Some/None  Response 

*  Count  Responders 

Associative  processing  can  best  be  understood  by  an 
example  of  a  single  controller  (a  teacher)  interacting  with 
an  associative  array  (a  class  of  students)  (Foster.  1076b], 
If  the  teacher  needs  to  know  if  any  student  in  a  class  has 
a  copy  of  a  particular  book  the  teacher  ran  simply  state, 
“If  you  have  the  book,  raise  your  hand.”  The  students  each 
make  a  check,  in  parallel,  and  respond  appropriately.  This 
corresponds  to  a  broadcast  operation  of  a  controller  and 
a  local  comparison  operation  at  each  pixel  in  an  array,  to 
check  for  a  particular  value.  Both  operations  assume  that 
the  local  processors  have  some  “intelligence”  to  perform  the 
comparison. 

Query  and  response  is  just  the  first  part  of  associative 
processing.  We  have  only  described  a  content  addressable 
(“If  your  hand  is  up.  I’m  talking  to  you.”)  scheme.  To 
perform  associative  processing,  we  must  be  able  to  condi¬ 
tionally  generate  tags  based  on  the  value  of  data  and  use 
those  tags  for  further  processing.  An  example  of  this  kind 
of  association  would  be,  “If  the  intensity  of  the  red,  green 
and  blue  values  are  each  in  a  certain  range,  label  yourself 
POSSIBLE-SKY”;  certainly  not  a  robust  technique,  but  a 
colorful  example.  The  interesting  processing  comes  when 
the  controller  starts  performing  multiple  logical  operations 
on  tags.  Since  each  pixel  (or  region)  could  have  multiple 
tags  base  on  properties  of  the  data  as  well  as  things  like  spa¬ 
tial  coordinates,  we  could  perform  operations  like:  if  you 
are  NOT-TREE  and  NOT-ROOF  and  (POSSIBLE-SKY  or 
POSSIBLE-CLOUD)  and  IN-TOP-OF-IMAGE  then  label 
yourself  LIKELY-SKY.  As  processing  continues  oniy  sub¬ 
sets  of  the  pixels  are  involved  in  any  particular  operation. 
We  are  selectively  processing  pieces  of  the  image  based  on 
their  properties,  but  we  are  operating  on  all  pixels  with  a 
given  set  of  properties,  in  parallel. 

The  ability  to  associate  tags  with  values  is  half  the  bat¬ 
tle  for  high  speed  control.  We  also  need  to  get  responses 
back  from  the  array  quickly.  Forcing  the  teacher  to  ask 
each  student  if  they  have  their  hand  up  defeats  the  pro¬ 
cess.  The  teacher  can  .see  immediately  if  any  of  the  students 
have  their  hands  up,  and  can  quickly  count  how  many  do. 
Similarly,  a  Some-response/No-response  (Some/None)  wire 
running  though  the  pixel  array  allows  the  controller  to  im¬ 
mediately  determine  properties  about  the  data  in  the  array, 
and  therefore  the  state  of  processing  in  the  array  without 
looking  sequentially  at  the  data  values  themselves. 

Additionally,  fast  hardware  to  perform  a  count  of  the 
responders  allows  the  controller  to  see  summary  informa¬ 
tion  about  the  state  of  the  data  in  the  array.  We  can  write 
programs  that  can  conditionally  perform  operations  based 


on  the  state  of  the  computation.  By  using  the  properties 
of  the  radix  representation  of  numeric  values  in  the  army 
we  can  use  the  counting  hardware  to  sum  the  values  in 
the  array.  The  ability  to  sum  values  gives  us  the  power  to 
compute  statistical  measures  such  as  mean  and  variance. 

These  examples  of  students  and  pixels  illustrate  the 
power  of  associative  processing.  We  use  associative  pro¬ 
cessing  as  our  paradigm  of  communications  in  the  upwards 
direction  and  control  in  the  downwards  direction  between 
each  pair  of  levels  in  the  hierarchy.  We  broadcast  criteria 
for  selecting  pixels,  or  regions,  or  symbolic  tokens  for  se¬ 
lective  processing.  In  this  way  higher  levels  control  lower 
levels.  We  test  and/or  count  the  response  that  comes  after 
processing  data  to  allow  conditional  branching  for  the  next 
step  of  processing  in  a  given  algorithm.  Thus  the  lower 
levels  provide  feedback  to  higher  ones. 

The  associative  select  operation  also  provides  one  mech¬ 
anism  for  transferring  non-summary  information  from  a 
lower  to  a  higher  level  in  the  vision  processing  hierarchy. 
The  higher  level  may  select  a  single  value  in  a  lower  level 
and  read  it  out.  This  transfer  of  information  is  one-to- 
one:  a  single  value  at  a  lower  level  is  copied  to  a  higher 
level. 


2.3.2  Parallel  Data  Transfer  Between  Levels 

Associative  select  is  useful  when  a  small  number  of  data 
values  must  be  copied  to  a  higher  level.  When  a  large  num¬ 
ber  of  values  must  be  transferred  between  levels  a  many- 
to-many  data  path  between  adjacent  levels  is  more  appro¬ 
priate.  A  simple  example  is  a  data  path  that  connects 
spatially  collocated  processors  in  adjacent  pairs  of  levels. 
These  processors  may  then  perform  inter-level  transfers  in 
parallel. 

The  many-to-many  communications  mechanism  permits 
the  use  of  a  programming  paradigm  in  which  each  level  is 
used  to  transform  a  lower  level  representation  into  a  higher 
level  representation.  All  or  part  of  this  new  representation 
may  then  be  extracted  by  the  next  higher  level  of  process¬ 
ing.  Each  level  may  then  treat  the  level  below  it  as  an 
associative  data  base. 

If  the  many-to-many  communications  mechanism  also 
permits  the  passing  of  control  information  from  higher  to 
lower  levels,  then  the  granularity  of  control  can  be  made  to 
vary.  This  would  allow,  for  example,  initial  processing  to 
take  place  in  a  coarse-grained  control  mode,  (SIMD)  and 
later  processing  to  take  place  in  a  fine-grained  control  mode 
(Multi-SIMD,  or  MIMD).  The  former  is  useful  for  generat¬ 
ing  initial  hypotheses  about  an  image,  while  the  latter  is 
better  suited  to  resolving  multiple  local  conflicts  between 
those  hypotheses,  and  to  filling  in  details  of  the  interpreta¬ 
tions. 


487 


3.  THE  IMAGE  UNDERSTANDING 
ARCHITECTURE(IUA) 

The  University  of  Massachusetts  (UMass)  together  with 
Hughes  Research  Laboratories  (HRL)  is  developing  a  three- 
level  tightly-coupled  associative  architecture  to  encompass 
all  levels  of  vision  processing.  The  Image  Understanding 
Architecture  combines  an  integrated  approach  to  the  three 
types  of  computation  outlined,  including  the  critical  prob¬ 
lems  of  communication  between  the  three  levels. 


3.1  Architecture  Overview 

The  Image  Understanding  Architecture  represents  a 
hardware  implementation  of  the  three  levels  of  abstraction 
in  our  view  of  computer  vision.  Overall  it  consists  of  three 
different,  closely  coupled  parallel  processors.  These  are  the 
Content  Addressable  Array  Parallel  Processor  (CAAPP)* 
at  the  low  level,  the  Intermediate  Communications  Asso¬ 
ciative  Processor  (ICAP)  at  the  intermediate  level,  and  the 
Symbolic  Processing  Array  (SPA)  at  the  high  level  (figure 
4). 


Image  Understanding  Architecture 


(Content  Addressable  Array  Parallel  Prooeesor  (CAAPP^ 

|  Sensory  Data  ^ 


•  64  LISP  32  bit  processors  (M1MD) 

•  Instantiation  of  schema  strategic! 

•  Construction  of  scene  interpretation 

Tbp~down  Parallel  Associative 
Communication  and  MIMD  Control 

X 

•  64  x  64  Array  ( Synchronous-MIMD ) 
of  16-  bit  processors 

•  64  parallel  local  8x8 
Multi-SIMD  arrays 

•  Executes  grouping  process 

•  Stores  intermediate 
symbolic  representation 

Parallel 
Associative 
Communication  and 
Multi-SIMD  Control 

X 

•  512  x  512  SIMD  Array 
of  1-bit  (serial)  ALUs 

•  Custom  VLSI  Chips 

•  Stores  sensory  data 

•  Executes  low-level  and 
segmentation  algorithms 


Figure  4.  IUA  Overview 


*  The  term  “content-addressable”  is  a  synonym  for  “as¬ 
sociative”  and  is  an  alternate  term  that  now  is  not  as 
widely  used  as  it  was  when  some  of  our  work  began 
(Weems,  1984a.) 


The  three  levels  of  the  IUA  would  at  first  appear  to 
constitute  a  multi-  resolution  pyramid  architecture, [Uhr, 
1972;  Tanimoto,  1983;  Hanson  and  Riseman,  I980j.  While 
it  is  possible  to  use  the  IUA  to  implement  algorithms  for 
a  multi-resolution  pyramid,  this  is  not  the  use  for  which 
it  is  envisioned.  Rather,  the  HJA  implements  a  hierar¬ 
chy  of  abstraction  which  corresponds  to  the  three  levels  of 
abstraction  (representation  and  processing)  in  the  UMass 
VISIONS  system. 

Architecturally,  it  is  easy  to  distinguish  the  differences 
between  the  IUA  and  a  pyramid  processor.  In  a  multi¬ 
resolution  pyramid,  each  higher  layer  is  a  power  of  two 
narrower  than  the  layer  below  it,  and  all  of  the  process¬ 
ing  elements  in  the  pyramid  are  identical,  with  each  layer 
being  treated  as  a  SIMD  parallel  processor.  In  the  IUA, 
however,  the  layer  widths  are  512,  64,  and  8  -  forming  a 
sparse  pyramid.  Most  importantly  the  processing  elements 
in  the  IUA  differ  greatly  from  layer  to  layer.  In  each  layer 
of  the  IUA  the  processing  elements  are  tuned  to  the  specific 
class  of  tasks  required  by  that  particular  level  of  abstrac¬ 
tion.  For  example,  it  is  inappropriate  to  try  to  run  LISP 
in  parallel  at  the  lowest  level,  because  the  low  level  is  pri¬ 
marily  concerned  with  fast  pixel  operations  that  will  build 
an  intermediate  level  symbolic  representation.  Thus,  the 
low  level  processors  are  tuned  for  real-time  image  process¬ 
ing  operations.  At  the  highest  level,  on  the  other  hand, 
symbolic  AI  processing  will  be  the  main  objective,  so  the 
high  level  processors  are  tuned  to  run  LISP  code  efficiently. 
This  leads  to  a  significant  physical  difference  between  the 
multi-resolution  pyramid  and  the  IUA:  whereas  the  amount 
of  circuitry  in  a  pyramid  decreases  by  a  factor  of  four  with 
each  higher  layer,  the  amount  of  circuitry  in  each  layer  of 
the  IUA  is  roughly  constant. 

Another  important  difference  between  a  multi-resolution 
pyramid  and  the  IUA  is  that  at  the  high  level,  the  IUA  is 
purely  an  MIMD  parallel  processor.  Additionally,  the  inter¬ 
mediate  and  low  levels  of  the  IUA  may  be  treated  in  a  vari¬ 
ety  of  modes.  These  include  the  CAAPP  operating  in  pure 
SIMD  or  Multi-SIMD  mode,  and  the  ICAP  operating  in 
synchronous-MIMI)  or  pure  MIMD  mode.  In  Multi-SIMD 
mode,  the  CAAPP  cells  excute  in  disjoint  SIMD  groups, 
with  each  group  receiving  a  different  instruction  stream. 
In  synchronoua-MIMD  mode,  the  programming  paradigm 
is  more  like  SIMD:  The  ICAP  processors  execute  similar  in¬ 
struction  streams,  and  globally  synchronize  for  each  stage 
of  processing.  Synchronous-MIMI)  has  the  advantage  of 
being  as  simple  to  program  as  a  SIMD  system  but  without 
the  time  penalty  of  having  to  sequentialize  on  branching 
structures.  The  various  modes  of  parallelism  in  the  IUA 
are  provided  to  allow  multiple  hypotheses  from  the  SPA  to 
be  evaluated  in  parallel  at  the  lower  levels. 

3.2  The  Three  Processing  Levels 

The  CAAPP  is  a  512  x  512  square  grid  array  of  1-bit 
serial  processors  intended  to  perform  low-level  image  pro¬ 


cessing  tasks.  It  is  similar  to  the  NASA/Goddard  MPP 
jBatcher,  1980]  but  with  an  architecture  that  is  especially 
oriented  towards  associative  processing  with  global  sum¬ 
mary  feedback  mechanisms.  This  reflects  the  difference  in 
application  domain  and  processing  strategy  for  these  two 
machines.  For  example,  a  typical  operation  on  the  MPP 
involves  enhancement  of  a  large  satellite  image,  where  the 
results  of  the  enhancement  are  presented  to  a  human  op¬ 
erator  who  then  decides  what  further  processing  will  be 
required.  The  goal  of  our  work,  on  the  other  hand,  is  auto¬ 
mated  real-time  vision  without  human  intervention,  where 
all  of  the  processing  and  interpretation  must  be  done  by 
the  system  itself.  Thus,  the  CAAPP  has  been  tailored 
to  permit  flexible  control,  and  to  provide  feedback  to  the 
controlling  processes  so  that  they  may  exercise  control  in 
response  to  actual  image  properties. 

The  intermediate  level  is  implemented  by  the  Interme¬ 
diate  and  Communications  Associative  Processor  (ICAP). 
The  ICAP  is  also  a  square  grid  associative  array,  of  more 
powerful  processing  elements;  the  ICAP  is  a  64  by  64  array 
of  16-bit  processors.  Each  ICAP  cell  is  associated  with  an 
8  by  8  tile  of  CAAPP  cells,  to  which  it  has  access. 

The  ICAP  is  designed  to  manipulate  the  tokens  in  the 
Intermediate  Symbolic  Representation  and  to  act  as  a  data 
base  for  queries  by  processing  elements  in  the  SPA.  For  ex¬ 
ample,  a  house-roof  schema  in  the  high  level  may  direct  the 
ICAP  to  group  together  long,  straight,  parallel  lines.  The 
schema  may  then  direct  the  ICAP  to  extract  parallelograms 
that  are  candidate  roof  outlines. 

A  typical  ICAP  level  symbolic  representation  of  a  line 
would  consist,  of  a  unique  label  for  the  line  and  a  set  of  fields 
which  quantify  its  attributes.  Such  attributes  may  include 
end  points,  orientation,  contour  length,  relative  curvature, 
direction  of  curvature,  average  contrast  across  the  line,  la¬ 
bels  of  adjacent  regions,  nearby  endpoints,  and  pointers  to 
nearby  or  related  lines.  Clearly,  while  the  bit-serial  pro¬ 
cessors  of  the  CAAPP  are  well  suited  for  developing  this 
representation,  they  will  be  inappropriate  for  manipulating 
it.  Thus,  the  CAAPP  is  used  to  develop  the  intermediate 
symbolic  representation,  which  is  then  passed  to  the  ICAP 
for  further  processing.  Should  the  need  arise,  the  results 
of  re-segmentation  in  the  CAAPP  can  be  integrated  with 
the  representation  in  the  ICAP.  To  facilitate  this  processing 
capability,  the  ICAP  representation  is  kept  in  approximate 
registration  with  the  original  image  events  in  the  CAAPP. 

The  SPA  processors  are  powerful,  general  purpose  mi¬ 
croprocessors  intended  for  performing  high-level  symbolic 
operations,  and  for  controlling  sub-array  processing  in  the 
ICAP  and  CAAPP  arrays.  To  the  SPA,  the  lower  levels 
appear  as  an  intelligent  database  that  is  part  of  a  shared 
global  memory.  The  shared  memory  decouples  the  SPA 
processes  from  the  locality  of  information  in  the  image.  An 
SPA  process  simply  makes  a  request  to  the  database  and 
then  waits  for  completion  of  the  request  before  accessing 
the  database  to  gel  the  results. 


489 


The  SPA  processors  will  run  a  LISP-based  blackboard 
system  in  which  various  schemas  will  cooperate  and  com¬ 
pete  in  the  generation  and  verification  of  hypotheses  about 
the  content  of  the  image  and  its  relationship  to  models  of 
the  environment.  From  the  point  of  view  of  the  black¬ 
board  system, the  CAAPP  and  ICAP  will  appear  as  knowl¬ 
edge  sources  at  different  levels  of  abstraction.  The  various 
schemas  in  the  system  will  activate  different  processes  in 
the  CAAPP  and  ICAP  for  either  the  full  array  or  indepen¬ 
dent  sub-arrays. 

The  IUA  is  conceived  as  a  stand-alone  image  interpre¬ 
tation  system.  Although  a  host  processor  is  attached  to  the 
global  controller  for  the  IUA,  the  host  is  intended  to  serve 
the  IUA  rather  than  the  other  way  around.  The  IUA  host 
will  provide  a  software  development  environment,  and  an 
access  point  for  users  of  the  IUA.  The  host  may  be  used 
for  examining  the  results  of  processing  on  the  IUA,  or  for 
monitoring  processes  in  the  IUA.  However,  the  host  system 
does  not  take  part  in  the  actual  image  interpretation  pro¬ 
cess.  The  dedicated  global  controller,  which  is  an  integral 
part  of  the  IUA  system,  is  responsible  for  managing  the 
CAAPP  and  ICAP  arrays  in  cooperation  with  the  SPA. 


3.3  Architecture  Details 

3.3.1  The  CAAPP:  Low  Level  Processing 

The  CAAPP  consists  of  a  512  by  512  array  of  bit-serial 
processing  elements  that  are  linked  through  a  four  way 
(S.E.W.N)  communications  grid  that  is  augmented  with  a 
proprietary  circuit  to  allow  certain  types  of  long  distance 
communication  to  take  place  quickly.  Each  processor  will 
contain  320  bits  of  RAM,  5  one-bit  registers  (A,  B,  X,  Y, 
Z).  an  ALU  and  data  routing  circuitry.  Each  element  also 
has  access  to  a  32K-bit  backing  store  memory  that  is  dual- 
ported  with  the  ICAP.  These  processing  elements  are  very 
simple,  but  it  is  this  simplicity  that  allows  us  to  place  61  of 
them  on  a  single  integrated  circuit;  a  density  of  processing 
elements  that  is  a  prerequisite  for  constructing  a  machine 
with  this  many  processors.  It  is  this  simplicity  that  also 
permits  the  CAAPP  to  execute  instructions  with  a  cycle 
time  of  100  nanoseconds.  The  initial  development  of  this 
design  is  decribed  in  great  detail  in  [Weems  1984a|.  Much 
work  has  gone  into  improvements  to  the  CAAPP  processing 
element  since  that  time  [Weems,  1984b,  1985[. 

The  key  to  integrating  the  CAAPP  into  the  IUA  is 
its  combination  of  associative  feedback  and  control  mecha¬ 
nisms.  The  principle  feedback  mechanism  in  the  CAAPP  is 
the  array-wide  logical  OR  output,  called  Some/None,  which 
indicates  whether  any  CAAPP  cells  are  in  a  given  state.  At 
the  end  of  each  instruction  cycle  the  logical  OR  of  the  re¬ 
sponse  bit  from  every  processing  element  is  automatically 
available  at  two  different  levels.  The  global  controller  re¬ 
ceives  this  signal  for  the  full  array,  while  the  ICAP  proces¬ 
sors  receive  the  Some/None  indication  for  only  the  portion 
of  the  CAAPP  array  associated  with  them. 


A  count  of  all  responding  cells  is  also  available  at  the 
global  controller  and  ICAP  level.  The  counting  operation  is 
used  to  gather  statistics  about  an  image  and  the  results  of 
processing.  For  example,  through  counting  we  may  quickly 
determine  the  mean  and  standard  deviation  of  an  attribute 
for  a  given  set  of  regions.  The  corresponding  sub-array 
counts  are  available  at  the  ICAP  at  the  end  of  each  CAAPP 
instruction  cycle.  The  global  controller  develops  the  full 
count  through  a  polling  mechanism  that  takes  16  CAAPP 
instruction  times  to  complete.  However,  the  polling  opera¬ 
tion  is  independent  of  processing  in  the  CAAPP  cells  and 
so  the  CAAPP  may  continue  to  operate  while  a  count  is 
being  formed. 

The  Select-First  operation  is  used  to  isolate  a  single 
CAAPP  cell  for  readout  or  processing  by  the  global  con¬ 
troller,  SPA,  or  ICAP  processors.  The  primary  purpose  is 
to  select  single  identified  cells  as  representatives  of  groups 
(regions  or  segments)  which  can  then  be  used  to  store  facts 
about  all  cells  in  that  group.  For  instance  one  cell  in  a 
region  might  keep  the  average  color  intensity  and  variance 
for  the  entire  region.  As  necessary,  the  data  in  these  cells 
can  be  moved  up  to  the  ICAP  level. 

The  principle  mechanism  for  transferring  data  hetween 
the  CAAPP  and  ICAP  is  a  shared  memory  structure.  Be¬ 
sides  the  normal  program  and  data  memory,  the  ICAP  has  a 
256K  byte  block  of  memory  that  is  shared  with  the  CAAPP. 
This  256K  arts  as  a  32K  bit  by  64-bit  barking  store  for 
the  on-chip  CAAPP  memory  and  is  the  primary  commu¬ 
nications  path  between  CAAPP  and  ICAP.  Swapping  to 
and  from  the  backing  store  is  done  through  dual-porting  of 
a  portion  of  the  on-chip  CAAPP  memory.  Although  ac¬ 
cess  to  the  off-chip  memory  is  about  10  times  slower  for 
the  CAAPP  than  access  to  the  on-chip  memory,  the  dual- 
porting  arrangement  permits  double-buffered  swapping  to 
take  place  while  the  CAAPP  is  processing  data  in  other 
memory  segments. 

In  the  event  that  large  amounts  of  CAAPP  data  must 
be  moved  up  to  the  SPA  or  controller,  the  data  will  first 
be  transferred  in  parallel  to  the  ICAP  processors  which  are 
also  linked  to  the  SPA  processors  through  a  dual  ported 
common  memory. 

The  current  design  of  the  CAAPP  processing  elements 
has  been  achieved  through  four  iterations  of  reduced  in¬ 
struction  set  analysis  and  redesign  of  the  processing  element 
architecture, [Weems,  1985).  The  result  is  a  design  that  is 
60  percent  faster  than  the  original  architecture,  with  an 
overall  decrease  in  the  device  count  for  the  processors  and 
control  circuitry.  Figure  5  shows  the  CAAPP  cell  archi¬ 
tecture,  and  figure  6  presents  the  instruction  set  for  the 
CAAPP. 

We  are  currently  preparing  a  CMOS  implementation  of 
the  CAAPP  chip  design  with  the  cooperation  of  Hughes 
Research  Laboratories.  The  64  processing  element  chip 


490 


Figure  5.  CAAPP  Cell  Architecture 


is  estimated  to  contain  approximately  60,000  devices,  ol 
which  80  percent  are  in  the  on-chip  memo.y.  This  is  roughly 
the  same  number  of  devices  found  in  16-bit  microprocessor 
chips,  but  is  much  simpler  to  design  and  implement  because 
of  the  repetitive  nature  of  the  parallel  processor  coll  design. 
It  is  essential  that  the  primary  working  data  memory  be 
on  the  same  chip  as  the  processing  elements,  in  order  to 
maximize  access  speed,  and  to  keep  the  number  of  pins  on 
the  chip  within  reasonable  limits. 

The  CAAPP  algorithm  in  figure  7  demonstrates  how 
the  CAAPP  is  programmed,  and  the  importance  of  rapid 
feedback  from  the  CAAPP  to  its  controller.  This  algorithm 
selects  all  cells  that  contain  the  maximum  value  in  a  given 
field.  In  addition,  the  maximum  value  is  available  in  the 
controller  at  the  end  of  this  operation.  Selecting  maximum 
values  requires  three  CAAPP  instruction  cycles  for  each  bit 
in  the  field. 

The  algorithm  begins  by  loading  the  high  order  bit  of  a 
field  into  the  response  register  of  all  active  cells.  The  global 
controller  then  tests  the  Some/None  output  of  the  array.  If 
any  cells  have  their  high  order  bit  set,  then  they  are  can¬ 
didates  for  the  maximum  value.  Any  cells  that  have  a  zero 
in  their  high  order  bit  are  then  deactivated.  However,  if  no 


cells  have  their  high  order  bit  set,  then  none  are  deactivated 
because  they  are  all  still  potential  candidates.  This  process 
repeats  with  each  successively  lower  order  bit  in  the  field. 
When  the  low  order  bit  has  been  processed,  only  those  cells 
which  contain  the  maximum  value  will  remain  active.  For 
each  iteration,  the  controller  saves  the  Some/None  response 
so  that  the  maximum  value  is  available  in  the  controller  at 
the  conclusion  of  processing. 

3.3.2  The  ICAP:  Intermediate  Level  Processing 

The  ICAP  is  also  a  square  grid  array  processor,  and  is 
intended  to  perform  intermediate  level  processing  on  image 
events  that  are  in  physical  registration  with  the  pixel  data. 
Each  ICAP  processing  element  is  associated  with  an  8  by 
8  tile  of  processors  in  the  CAAPP  array.  An  ICAP  pro¬ 
cessor  has  access  to  data  stored  in  any  of  the  64  CAAPP 
processors  it  is  associated  with.  Each  ICAP  processor  also 
has  access  to  the  global  summary  information  for  those  64 
processors. 

There  are  two  reasons  for  choosing  64  as  the  number 
of  CAAPP  processors  to  associate  with  each  ICAP  proces¬ 
sor.  First,  it  is  convenient  to  associate  one  ICAP  with  each 


491 


18 


13 


0 


31 


27 


23 


OUNHfA  Ftn  P, 


S.  "p, 


9  8 


5, 


Dtat 


Addreti 


1NH 

(Inhibit) 

Ci.y.r 

0 

Non-inhibit  —  always  active 

0  TRUE 

1 

Inhibit  if  A  =  0 

1  Complement  S,,,  Result 

2 

Inhibit  if  A  =  0  or  S/N  =  Some 

3 

Inhibit  if  A  =  0  or  S/N  =  None 

Si 

\Si 

Ftn 

Dest 

0 

icro-reg 

eero-reg 

Some /None  ^  Dest 

X„r 

0 

1 

C 

C 

Si  =>  Dest 

A,X 

1 

2 

5 

E 

Sj  =>•  Dest 

A,  X  <=  Si 

2 

3 

N 

W 

Si  A  Sj  =>  Dest 

A,X<=  Sj 

3 

4 

Y 

Y 

Si  V  Sj  =>  Dest 

Y 

4 

5 

X 

X 

Si  ®  Sj  =>  Dest 

X 

5 

6 

B 

B 

Si  +  Sj  +~Z  =>  Dest 

B 

6 

7 

A 

A 

ICAP  C  =s  Dest 

A 

7 

8 

memory 

memory 

Si  =>  Z 

memory  1 

8 

9 

— 

— 

memory  =>  MR 

— 

9 

10 

— 

— 

memory  =>  MR,  S B 

— 

10 

11 

— 

— 

MR  =>  memory 

— 

11 

12 

— 

— 

MR,  SB  =>  memory 

— 

12 

13 

— 

— 

— 

— 

13 

14 

— 

— 

— 

— 

14 

15 

— 

— 

— 

— 

15 

Figure  6.  CAAPP  Cel!  Instructions 


FOR  Bit  (V (i rri  :  Field  Length  -  1  OOWNTO  0  DO 
Response  :  Field | Bit  Num| 

IF  Some 
THEN 

Activity  :  Response 


Figure  7.  Finding  a  Maximum 


CAAPP  integrated  circuit,  since  this  greatly  simplifies  the 
interface  between  the  two  levels.  Second,  with  a  512  by  512 
array  at  the  bottom  and  three  processing  levels  plus  a  single 
controller  at  the  top,  a  uniform  inter-level  connectivity  is 
provided  by  a  64  to  1  reduction  factor  between  each  of  the 
levels.  Finally,  given  the  processing  needs  and  increased 
power  of  the  processing  elements  at  each  level,  we  roughly 
estimate  64  as  a  viable  scale  reduction  between  each  pair 
of  levels. 

Control  of  the  ICAP  is  provided  by  the  global  controller 
(in  Synchronous-MIMD  mode)  and  by  the  SPA  (in  MIMD 
mode).  Once  an  intermediate  symbolic  representation  has 
been  passed  to  the  ICAP,  and  initial  grouping  operations 
have  taken  place,  each  of  the  SPA  processors  may  then 
query  the  ICAP  in  parallel  to  establish  and  verify  hypothe¬ 
ses.  The  ICAP  provides  four  different  global  OR  responses 
and  a  global  summation  value  as  feedback  to  the  global 
controller. 


Each  of  the  4096  (64  by  64)  ICAP  processors  consists  of 
an  ALU  that  may  perform  16,  8,  or  1  bit  operations,  256k 
bytes  of  local  RAM,  dual  ported  memory  for  interacting 
with  the  CAAPP  and  SPA,  and  neighbor  communications 
hardware.  The  hardware  for  an  ICAP  cell  consists  of  an  off- 
the-shelf  CPU  and  memory,  plus  a  custom  VLSI  circuit  that 
provides  inter-level  communications  and  control  functions. 
For  our  ICAP  processor  we  have  chosen  the  TMS320C25. 
The  320C25  operates  at  5  million  instructions  per  second 
and  can  perform  a  16  bit  multiply  and  accumulate  opera¬ 
tion  in  a  single  instruction  time. 

The  ICAP  communicates  in  the  vertical  dimension  via 
two  sets  of  shared  memory  structures.  There  is  a  shared 
memory  which  acts  as  a  backing  store  for  the  CAAPP  and 
is  available  in  the  address  space  of  the  ICAP.  In  addition, 
the  ICAP  and  the  SPA  share  a  common  memory.  This  is 
viewed  as  an  I/O  device  for  the  ICAP  and  is  visible  in  the 


492 


address  space  of  all  the  processors  of  the  SPA.  Communi¬ 
cation  between  the  1CAP  and  the  SPA  takes  place  in  much 
the  same  way  as  communication  between  the  CAAPP  and 
1CAP  except  that  they  are  transferring  information  that 
is  represented  more  abstractly  than  the  information  that  is 
passed  between  the  CAAPP  and  ICAP.  These  up  and  down 
connections  provide  the  bi-directional  and  inter-level  com¬ 
munications  necessary  for  image  interpretation.  Through 
the  up/down  links,  information  may  be  passed  upward  as 
its  level  of  abstraction  dictates,  and  reinforcements  of  com¬ 
peting  hypotheses  may  be  passed  downward  as  needed  to 
resolve  conflicts. 

The  side  to  side  links  in  the  ICAP  provide  the  intra- 
level  communications  necessary  for  grouping  and  merging 
processes  to  take  place  on  the  intermediate  symbolic  repre¬ 
sentation.  These  links  are  circuit  switched  and  provide  a  5 
M-bit/second  data  transfer  rate  between  ICAP  processors. 
The  topology  of  this  network  is  a  mesh  with  additional  links 
that  provide  for  both  local  neighborhood  and  long  distance 
communication  across  the  array. 

3.3.3  The  SPA:  High  Level  Processing 

The  high  level  is  implemented  by  the  Symbolic  Pro¬ 
cessing  Array  (SPA),  which  is  an  ensemble  of  64  powerful 
processors.  Each  SPA  processor  is  a  32-bit  computer  with 
at  least  4  Mbytes  of  memory.  In  the  first  prototype  slice  of 
the  1UA,  the  SPA  will  be  an  M68020  class  processor. 

The  SPA  is  designed  to  support  parallel  rule-based  and 
schema  (frame-based)  processing  within  the  framework  of  a 
blackboard  architecture  |I)raper  et  al.,  1986|.  Insuchasys- 
tem,  various  s'-hemas  are  activated  in  parallel  via  focus-of- 
attention  strategies  to  evaluate  results  of  initial  bottom-up 
processing.  These  schemas  generate  hypotheses  which  acti¬ 
vate  verification  processes  in  the  form  of  other  schemas. 
These,  in  turn,  take  MSIMl)  control  of  the  ICAP  and 
CAAPP  to  perform  top-down  processing  and  re-evaluation. 
The  schema  processes  cooperate  and  compete  via  messages 
posted  on  the  system  blackboard. 

Under  our  current  plan,  communications  between  SPA 
processors  will  take  place  in  part  via  a  virtual  full-connection 
network.  This  will  be  implemented  by  a  high-speed  ring 
structure  that  allows  one  data  word  to  be  passed  from 
each  processor  to  every  other  processor  in  a  single  instruc¬ 
tion  time.  The  ring  I/O  ports  into  each  processor  will 
be  memory  mapped  into  a  dual  ported  content  address¬ 
able  input/output  memory  (CAIOM).  This  structure  was 
previously  described  (Levitan,  1984),  as  a  Full-CAPP  in¬ 
terconnect.  Such  a  structure  is  particularly  well  suited 
for  implementing  a  distributed  blackboard  system.  Un¬ 
like  a  simple  full-connect  network,  the  CAIOM  augmented 
full  connect  network  does  not  overload  the  processors  with 
having  to  process  all  of  the  messages  that  have  been  trans- 
mittel.  Because  the  messages  are  stored  directly  into  an 
associative  memory  upon  arrival,  the  processor  may  ex¬ 
amine  them  via  parallel  associative  operations.  In  many 


cases,  the  Some/None  and  Count  values  are  all  the  infor¬ 
mation  about  the  messages  that  is  needed.  If  a  particular 
message  must  be  examined  directly  then  the  Select  First 
mechanism  permits  the  SPA  processor  to  directly  access 
that  message,  without  first  having  to  search  through  the 
other  messages.  The  CAIOM  is  viewed  as  a  high  prior¬ 
ity  mechanism  for  passing  update  and  control  information 
about  the  state  of  the  blackboard.  The  blac  kboard  itself 
will  most  likely  reside  in  a  global  shared  memory  that  in¬ 
cludes  a  data  space  which  is  dual-ported  with  the  ICAP 
processors.  This  approach  allows  the  blackboard  to  contain 
a  tremendous  amount  of  information  and  yet  the  knowledge 
source’s  attention  can  be  quickly  refocussed  through  mes¬ 
sage  passing  over  the  CAIOM  network. 

3.4  Architecture  Summary 

The  three-level  structure  of  the  UMass  Image  Under¬ 
standing  Architecture  supports  the  necessary  hierarchy  of 
abstractions  for  the  different  representations  and  operations 
we  need  to  solve  the  vision  problem.  Each  level  is  con¬ 
structed  to  perform  a  suite  of  tasks  most  appropriate  for 
that  level  of  abstraction. 

The  CAAPP  is  optimized  to  perform  local  operations 
on  neighborhoods  of  pixels  and  to  provide  feedback  to  the 
higher  levels  of  processing  about  the  state  of  the  computa¬ 
tion  and  statistics  about  low  level  data.  It  excels  at  very 
tightly-coupled  fine-grained  parallelism.  The  mapping  of 
one  pixel  onto  each  processor  ensures  that  the  maximum 
amount  of  parallelism  available  in  the  low  level  vision  tasks 
will  be  utilized. 

The  ICAP  is  designed  to  support  the  necessary  tasks  of 
building  an  intermediate  representation  of  the  image  ami 
operating  on  that  representation.  These  operations  need 
two  primary  capabilities,  data  manipulation  and  communi¬ 
cation.  The  data  representations  used  by  the  CAAPP  need 
to  be  transformed  by  the  ICAP  into  a  more  accessible  for¬ 
mat,  and  then  passed  to  neighboring  ICAP  cells  to  perform 
merging  and  grouping  operations. 

The  high  level  tasks  which  perform  schema-based  rea¬ 
soning  run  in  the  SPA.  Schemas  are  processes  consisting 
of  programs  and  data  structures  which  are  used  to  reason 
about  the  image  in  terms  of  a  knowledge  database.  To  sup¬ 
port  this  kind  of  distributed  artificial  intelligence  processing 
we  need  powerful  processors  with  large  amounts  of  mem¬ 
ory.  The  communication  between  processes  will  primarily 
be  in  terms  of  a  distributed  blackboard  system  managed 
by  the  processes  themselves.  As  these  processes  run  and 
make  requests  to  the  ICAP  (and  sometimes  directly  to  the 
CAAPP)  they  will  extract  information  about  the  image  and 
post  the  results  of  their  analysis  on  the  blackboard  for  other 
processes  to  use.  The  end  result  will  be  an  interpretation 
of  the  image. 


493 


3.5  Current  Status 

At  present  we  are  building  a  1  /64th  scale  demonstration 
prototype  of  the  IUA.  This  is  scheduled  for  completion  in 
early  1988  and  will  include  4096  CAAPP  cells,  64  ICAP 
cells,  a  single  SPA  processor  and  an  array  control  unit. 
The  entire  prototype  will  plug  into  a  single-user  workstation 
that  will  serve  as  a  host. 

The  prototype  will  physically  consist  of  a  16  by  20 
inch  mother  board  with  83  daughter  boards.  Eighty  of 
the  daughter  boards  are  4  by  2.5  inches  and  the  remaining 
three  are  20  by  2.5  inches  in  size.  Of  the  80  small  daughter 
boards,  16  are  for  clock  distribution  and  signal  buffering. 
The  other  64  contain  the  ICAP  and  CAAPP  processors  and 
their  memories.  The  three  larger  daughterboards  provide 
the  controller  interface,  feedback  concentration,  and  ICAP 
communications  network  switching.  The  motherboard  also 
includes  a  dual  ported  frame  buffer  memory  that  allows 
high  speed  image  input  and  output. 

Each  processor  daughterboard  will  contain  a 
single  custom  VLSI  chip,  TMS320C25,  256K  bytes  of  static 
RAM,  384K  bytes  of  dual-ported  dynamic  ram,  and  tri¬ 
state  bus  buffers.  The  single  custom  chip  holds  the  64 
CAAPP  processors  with  their  local  memories,  the  backing 
store  controller,  a  refresh  controller  for  the  dynamic  RAM, 
and  arbitration  logic  for  the  various  devices  that  must  ac¬ 
cess  the  bus  of  the  associated  ICAP  processor.  Total  power 
dissipation  for  a  processor  daughterboard  is  estimated  at 
approximately  5  watts. 

The  custom  VLSI  chip  is  being  designed  in  2  micron 
CMOS  with  2  metal  layers.  A  32  processor  test,  chip  is  cur¬ 
rently  undergoing  fabrication  through  the  MOSIS  facility. 
A  lirst  run  of  the  complete  custom  chip  is  scheduled  for 
summer  of  1987. 

Our  software  simulator  is  being  re-written  to  run  on 
an  Odyssey  signal  coprocessor  in  a  Texas  Instruments  Ex¬ 
plorer.  The  Odyssey  allows  a  direct  emulation  of  the  ICAP 
processor  and  greatly  improves  the  execution  times  for 
CAAPP  simulations  over  our  VAX  based  simulator.  The 
Odyssey  simulator  will  also  permit,  us  to  closely  mimic  the 
interactions  of  the  three  processing  levels  down  to  the  sig¬ 
nal  level.  The  Odyssey  based  simulator  will  initially  pro¬ 
vide  the  capability  of  a  single  IUA  daughterboard,  and  will 
eventually  be  extended  to  simulate  one  motherboard. 

A  VAX-based  high  level  emulator  is  also  planned  for 
development.  Whereas  the  Odyssey  simulator  is  designed 
to  allow  assembly  language  level  programming,  the  VAX 
emulator  will  be  the  vehicle  of  choice  for  researchers  who 
wish  to  get  an  idea  of  how  the  user-level  IUA  environment 
will  behave.  The  emulator  will  sacrifice  low  level  accuracy 
in  favor  of  greater  speed.  For  example,  the  emulator  will 
be  restricted  to  8,  16  and  32  bit  arithmetic,  avoiding  the 
slow-to-simulate  bit-serial  methods  that  are  actually  used 
in  the  CAAPP. 


Beyond  simply  testing  our  hardware  design,  our  ulti¬ 
mate  goal  for  the  prototype  is  to  provide  a  powerful  in¬ 
terim  development  environment  for  image  understanding 
parallel  processing  research.  A  simulated  parallel  processor 
is  simply  too  slow  to  permit  any  significant  amount  of  ex¬ 
perimentation.  Once  our  prototype  is  up  and  running,  we 
will  be  able  to  accomplish  more  in  the  first  ten  seconds  of 
execution  time  than  we  have  been  able  to  do  in  our  previous 
five  years  of  simulation. 

Because  having  this  much  processing  power  in  a  box 
the  size  of  a  personal  computer  is  so  attractive,  we  have 
designed  our  prototype  to  be  easily  reproducible  for  a  rea¬ 
sonable  cost.  It  has  also  been  designed  to  be  easily  adapted 
to  different  host  systems.  We  thus  hope  that  it  will  be  pos¬ 
sible  to  construct  several  copies  of  the  small  scale  system 
so  that  it  can  be  available  to  a  number  of  researchers  prior 
to  construction  of  the  full  scale  machine. 


4.  THE  OPERATING  ENVIRONMENT 

An  integrated  multiprocessor  system  cannot  be  pro¬ 
grammed  as  a  collection  of  separate  machines.  For  our 
system,  even  the  three  tiers  of  processors  must  be  viewed  as 
a  single  system.  This  does  not  make  programming  harder, 
rather  this  should  be  viewed  as  a  version  of  the  classic  model 
of  a  layered  system.  In  operating  system  design,  for  in¬ 
stance,  we  often  design  a  system  as  a  hierarcny  of  virtual 
machines  each  providing  a  service  to  the  one  above  by 
making  demands  (calls)  to  the  one  below.  The  IUA  follows 
this  paradigm  explicitly:  The  hardware  /  software  system 
at  each  level  provides  a  clean  virtual  machine  for  the  levels 
above  and  below  it. 

The  high  level  system  needs  to  run  a  parallel  LISP 
implementation  of  the  schema  system.  The  currently  ex¬ 
isting  software  environment  for  schema  development  on  a 
single  processor  is  outlined  in  |Draper  et  at.,  1 986 j .  Be¬ 
sides  the  necessary  support  for  a  parallel  LISP  environ¬ 
ment,  we  need  special-purpose  hardware  (and  system  sup¬ 
port)  for  two  functions:  inter-schema  communication  and 
queries  and  updates  to  the  intermediate  level  representa¬ 
tion.  Schemas  not  only  need  to  communicate  with  other 
schemas  to  resolve  conflicts  and  ambiguity,  they  need  to 
locate  which  other  schemas,  out  of  hundreds  or  thousands, 
share  related  image  events.  The  CAIOM  hardware,  and 
LISP/operating  system  support  will  allow  schemas  to  post, 
requests  and  assoriatively  search  for  other  schemas  with 
the  same  objects  of  interest  in  the  image. 

Queries  to  the  intermediate  level  from  the  SPA  are  pro¬ 
cessed  by  making  requests  to  the  ICAP  processors.  These 
requests  can  be  of  two  forms,  local  to  a  specific  sub-image, 
or  global  to  the  whole  image.  The  global  array  control 
unit  receives  these  requests  and  queues  them  for  process¬ 
ing.  Processing  a  request  involves  running  one  (or  several) 
“scripts”.  These  are  microcode  sequences  broadcast  to  the 
CAAPP/ICAP  arrays  as  sets  of  instructions  and  data.  The 


494 


scripts  are  not  free  of  loops  or  branches.  Close  interaction 
with  the  feedback  from  the  arrays  makes  it  possible  to  pro¬ 
gram  data-dependent  conditional  execution  in  the  scripts. 
Data  dependent  interactions  between  the  ICAP  processors 
and  their  associated  CAAPP  cells  is  also  possible. 

5.  THE  DEVELOPMENT  ENVIRONMENT 

During  the  hardware  development  phase,  software  func¬ 
tional  level  simulators  are  being  used  for  testing  implemen¬ 
tations  of  vision  algorithms  on  the  IUA.  CAAPP  simulators 
have  been  in  use  to  refine  both  the  architecture  and  low 
level  algorithms.  New  CAAPP  and  ICAP  simulators  will 
be  available  both  separately,  for  testing  algorithms  that  run 
on  a  single  level,  and  in  an  integrated  form.  The  SPA  simu¬ 
lator  will  be  a  multiprocessing  CommonLisp  environment, 
running  on  a  commercial  multiprocessor,  and  augmented 
with  functions  that  simulate  the  CAiOM  communications 
mechanism. 

The  simulators  may  be  adjusted  to  simulate  a  CAAPP 
with  a  width  of  any  power  of  two  from  8  through  512  (with 
the  ICAP  simulator  adjusted  correspondingly  to  a  width 
of  1  through  64).  Most  software  testing  will  be  done  with 
the  simulators  set  to  a  small  array  width  to  facilitate  turn¬ 
around  time  on  simulations.  When  the  simulators  are  set  to 
full  width,  our  experience  has  been  that  we  achieve  less  than 
one  second  of  simulated  IUA  time  per  year  of  VAX  CPU 
time.  Of  course,  once  the  hardware  becomes  available,  the 
development  environment  will  shift  entirely  onto  it,  in  order 
to  speed  up  software  testing. 

In  addition  to  the  simulators,  assemblers  are  planned 
for  the  CAAPP  and  ICAP.  These  are  assemblers  only  in 
that  they  allow  the  programmer  to  program  the  CAAPP 
and  ICAP  at  the  instruction-by-instruction  level.  The  as¬ 
semblers  will  be  interactive,  providing  information  to  the 
programmer  that  will  aid  in  optimizing  utilization  of  the 
two  levels  in  parallel.  Programs  that  are  constructed  through 
the  assemblers  will  initially  execute  on  the  simulators,  and 
later  on  the  actual  hardware.  Each  program  generated  by 
the  assemblers  also  takes  the  form  of  a  LISP  or  C  callable 
subroutine.  This  permits  the  SPA  simulator,  and  the  UMass 
VISIONS  system  to  interface  directly  with  any  software 
that  is  developed  for  the  IUA. 

6.  FURTHER  DEVELOPMENT 

We  are  currently  working  with  researchers  at  Hughes 
Research  Laboratories  on  refinements  to  the  ICAP  struc¬ 
ture  and  the  CAAPP  to  ICAP  control  interface.  One  pos¬ 
sibility  being  explored  is  to  give  the  ICAP  the  ability  of 
interacting  with  a  local  CAAPP  controller.  The  CAAPP 
would  then  be  able  to  execute  in  Multi-SIMD  mode  at  the 
chip  level:  Each  8x8  section  could  then  execute  a  s  p- 
arate  instruction  stream.  Another  extension  would  be  to 
augment  the  ICAP  globally  controlled  circuit  switched  net¬ 
work  with  a  distributed  control  routing  network. 


7.  CONCLUSION 

As  we  discussed  in  section  I,  we  believe  that  to  build 
a  computer  vision  system  we  must  provide  an  architecture 
which  can  help  address  the  problem  areas  of:  the  unreliabil¬ 
ity  of  segmentation  processes,  the  compounding  of  uncer¬ 
tainty  from  different  processing  techniques,  the  need  to  pro¬ 
vide  mechanisms  for  information  fusion  as  well  as  support 
for  multiple  representations  of  data  and  meta-knowledge, 
the  need  for  global  context  dependent  local  processing,  and 
support  for  inferencing  techniques. 

The  Image  Understanding  Architecture  consists  of  an 
integrated  vision  system  executing  on  a  parallel  processing 
ensemble.  It  supports  many  processes  performing  many 
tasks  at  three  levels  of  abstraction.  Each  level  is  embodied 
in  a  particular  set  of  processing  elements  suited  to  the  re¬ 
quirements  at  that  level  of  processsing.  High  speed  low  level 
local  processing  elements  provide  fast  segmentation  pro¬ 
cesses  which  can  be  tuned  and  retried.  Associative  process¬ 
ing  allows  feedback  from  the  processes  to  the  control  system 
as  well  as  supporting  context  dependent  operations.  Infor¬ 
mation  fusion  is  provided  by  intermediate  level  processes 
running  at  the  appropriate  level  of  abstraction  to  reduce 
uncertainty  from  different  processing  strategies.  Symbolic 
processing  at  the  high  level  supports  a  schema  based  infer¬ 
ence  system  where  knowledge  about  the  real  world  is  repre¬ 
sented  as  both  data  and  control  information.  Communica¬ 
tion  and  control  both  between  levels  (vertically)  and  among 
processors  at  each  level  (horizontally)  allows  the  system  to 
work  in  a  concerted  manner  to  perform  machine  vision. 

8.  ACKNOWLEDGEMENTS 

We  would  like  to  thank  our  colleagues  at  Hughes  Re¬ 
search  Laboratories  in  Malibu.  Dr.  J.  Gregory  Nash  and 
Dr.  David  B.  Shu  for  their  many  contributions  to  the  design 
and  implementation  of  the  Image  Understanding  Architec¬ 
ture. 

This  research  was  supported,  in  part,  by  the  Advanced 
Research  Projects  Agency  of  the  Department  of  Defense 
and  was  monitored  by  the  Air  Force  Office  of  Scientific 
Research  under  Contract  No.  F49620-86-C-0041  and  by 
the  Engineer  Topographic  Laboratory  under  Contract  No. 
DACA76-86-C0015.  We  would  also  like  to  thank  Caxton 
C.  Foster  for  his  ideas,  support  and  guidance. 


495 


9.  REFERENCES 


Akers,  S.  B.  and  Krishnamurthy,  B.,  A  Group  Theoretical 
Model  for  Symmetric  Interconnection  Networks,  Proc.  1986 
International  Conference  on  Parallel  Processing,  K,  Hwang, 
S.  M.  Jacobs  and  E.  E.  Swartzlander  (Eds.),  St.  Charles, 
III.,  August  19-22,  1986,  pp.  216-223. 

Batcher,  K.  E.,  Design  of  a  Massively  Parallel  Processor, 
IEEE  Trans.  Comp., 29:1-9,  September  1980. 

Draper,  B.,  Hanson,  A.,  and  Riseman,  E.,  A  Software  En¬ 
vironment  for  High  Level  Vision,  COINS  Technical  Report, 
University  of  Massachusetts  at  Amherst,  (in  preparation) 
1986. 

Erman,  L.,  et  al.,  The  Hearsay-II  Speech-Understanding 
System:  Integrating  Knowledge  to  Resolve  Uncertainty,  Com 
puting  Surveys,  12:213-253,  1980. 

Faugeras,  O.,  and  Price,  K.,  Semantic  Descriptions  of  Aerial 
Images  Using  Stochastic  Labeling,  IEEE  Trans.  Pattern 
Analysis  and  Machine  Intelligence,  3:638-642,  1981. 

Foster,  C.  C.,  Content  Addressable  Parallel  Processors ,  New 
York:  Van  Nostrand  Reinhold,  1976.  (a) 

Foster,  C.  C.,  Computer  Architecture,  New  York:  Van  Nos¬ 
trand  Reinhold,  1976.  (b) 

Hanson,  A.  R.  and  Riseman,  E.  M.,  Preprocessing  Cones:  A 
Computation  Structure  for  Scene  Analysis,  COINS  Techni¬ 
cal  Report  74C-7,  University  of  Massachusetts  at  Amherst, 
September  1974. 

Hanson,  A.  R.  and  Riseman,  E.  M.  (Eds.),  Computer  Vision 
Systems,  New  York:  Academic  Press,  1978. (a) 

Hanson,  A.  R.,  Riseman,  E.  M.,  VISIONS:  A  Computer 
System  for  Interpreting  Scenes.  In:  Computer  Vision  Sys¬ 
tems,  A.  R.  Hanson,  and  E.  M.  Riseman  (Eds.),  New  York: 
Academic  Press,  1978,  pp.  303-333. (b) 

Hanson,  A-  R.,  Riseman,  E.  M.,  Segmentation  of  Natural 
Scenes.  In:  Computer  Vision  Systems,  A.  R.  Hanson,  and 
E.  M.  Riseman  (Eds.),  New  York:  Academic  Press,  1978, 
pp.  1 29-163. (c) 

Hanson,  A.  R.,  Riseman,  E.  M.,  Processing  Cones:A  Com¬ 
putational  Structure  for  Scene  Analysis  for  Image  Analy¬ 
sis.  In:  Structured  Computer  Vision,  S.  Tanimoto  and  A. 
Klinger  (Eds.),  New  York:  Academic  Press,  1980. 

Hanson,  A.  R.,  Riseman,  E.  M.,  A  Methodology  for  the 
Development  of  General  Knowledge-Based  Vision  Systems, 
(to  appear).  In:  Vision,  Brain,  and  Cooperative  Compu¬ 
tation,  M.  Arbib  and  A.  Hanson  (Eds.),  Cambridge,  MA: 
MIT  Press,  1986. 


Levitan,  S.  P.,  Parallel  Algorithms  and  Architectures:  A 
Programmers  Perspective,  Ph.D.  Dissertation,  Computer 
and  Information  Science  Department,  also,  COINS  Techni¬ 
cal  Report  84-11,  University  of  Massachusetts  at  Amherst, 
May  1984. 

Levitan,  S.  P.,  Measuring  Communication  Structures  in 
Parallel  Architectures  and  Algorithms,  (to  appear).  In:  The 
Characteristics  of  Parallel  Algorithms,  L.  Jamieson,  D.  Gan¬ 
non,  and  R.  Douglass  (Eds.),  Cambridge,  MA:  MIT  Press, 
1987. 

Parma,  C.  C.,  Hanson  A.  R.,  and  Riseman,  E.  M.,  Ex¬ 
periments  in  Schema-Driven  Interpretation  of  a  Natural 
Scene,  COINS  Technical  Report  80-10,  University  of  Mas¬ 
sachusetts  at  Amherst,  April  1980.  Also  in:  NATO  Ad¬ 
vanced  Study  Institute  on  Digital  Image  Processing,  R.  Har- 
alick  and  J.  C.  Simon  (Eds.),  Bonas,  France,  1980. 

Reynolds,  G.,  Irwin,  N.,  Hanson  A.  R.,  and  Riseman,  E.  M., 
Hierarchical  Knowledge-Directed  Object  Extraction  Using 
a  Combined  Region  and  Line  Representation,  Proc.  of  the 
Workshop  on  Computer  Vision:  Representation  and  Con¬ 
trol,  Annapolis,  Maryland,  April  30  -  May  2,  1984,  pp.  238- 
247. 

Tanimoto,  S.,  A  Pyramidal  Approach  to  Parallel  Process¬ 
ing,  Proc.  10th  Annual  International  Symp.  on  Computer 
Architecture,  Stockholm,  Sweden,  June,  1983. 

Uhr,  L.,  Layered  Recognition  Cone  Networks  that  Prepro¬ 
cess,  Classify  and  Describe,  IEEE  Trans.  Comp., 21:758- 
768,  1972. 

Weems,  C.  C.,  Image  Processing  on  a  Content  Address¬ 
able  Array  Parallel  Processor,  Ph.D.  Dissertation  Computer 
and  Information  Science  Department,  also,  COINS  Techni¬ 
cal  Report  84-14,  University  of  Massachusetts  at  Amherst, 
September  1984. (a) 

Weems,  C.  C.,  Levitan,  S.  P.,  Foster,  C.  C.,  Riseman,  E. 
M.,  Lawton,  D.  T.,  and  Hanson,  A.  R.,  Development  and 
Construction  of  a  Content  Addressable  Array  Parallel  Pro¬ 
cessor  (CAAPP)  for  Knowledge-Based  Image  Interpreta¬ 
tion,  Proc.  Workshop  on  Algorithm-Guided  Parallel  Archi¬ 
tectures  for  Automatic  Target  Recognition,  Leesburg,  VA, 
July  16-18,  1984,  pp.  329-359.(b) 

Weems,  C.  C.,  The  Content  Addressable  Array  Parallel 
Processor:  Architectural  Evaluation  and  Enhancement,  Proc 
IEEE  International  Conference  on  Computer  Design:  VLSI 
in  Computers,  Port  Chester,  New  York,  October  7-10, 1985. 
pp.  500-503. 

Weymouth,  T.  E.,  Using  Object  Descriptions  in  a  Schema 
Network  for  Machine  Vision,  Ph.D.  Dissertation,  Computer 
and  Information  Science  Department,  also,  COINS  Techni¬ 
cal  Report  86-24,  University  of  Massachusetts  at  Amherst, 
1986. 


i 

I 

5 

f 


496 


CONSTRUCTS  FOR  COOPERATIVE 
[MAGE  UNDERSTANDING  ENVIRONMENTS 


Christopher  C.  McConnell,  Philip  C.  Nelson,  Daryl  T.  Lawton 


Advanced  Decision  Systems,  Mountain  View,  California  94040 


ABSTRACT 

Researchers  in  machine  vision  require  a  rich  set  of 
tools  for  developing  theories,  algorithms,  and  systems. 
These  range  from  supporting  the  expression  of  numeri¬ 
cally  intensive  computations  to  rule-based,  symbolic 
inference  processes.  This  paper  describes  a  tightly  cou¬ 
pled  set  of  such  constructs  that  have  been  developed 
for  a  software  environment  for  image  understanding. 
The  major  components  of  this  are:  1)  an  object- 
oriented  programming  mechanism  for  defining,  creat¬ 
ing,  modifying,  and  combining  objects.  There  are  also 
databases  for  the  storage  and  access  of  long  term  infor¬ 
mation  such  as  procedures,  object  definitions,  and 
.nstances  of  processing  environments;  2)  An  underlying 
functional  form  built  from  an  extendable  set  of  macros 
that  is  used  for  expressing  common  processing  opera¬ 
tions  for  several  different  types  of  objects;  3)  A  set  of 
uniform  databases  for  automatically  maintaining 
resources  and  results  in  an  interactive  or  autonomous 
processing  environment;  4)  An  interactive  display  facil¬ 
ity  for  images  that  generalizes  to  a  wide  range  of  other 
objects,  including  curves,  regions,  and  groups. 


1.  INTRODUCTION 


Researchers  in  computer  vision  deal  with  prob¬ 
lems  at  several  different  levels  of  computation,  from 
numerically  intensive  computations  that  are  often 
accomplished  in  parallel  over  large,  structured  arrays,, 
to  processing  involving  symbolic  and  relational  objects, 
such  as  extracted  image  structures,  instantiated  model 
hypotheses,  and  interpretation  rules.  This  work 
requires  a  highly  interactive  environment  in  which  to 
prototype,  experiment,  and  interpret  results.  In  addi¬ 
tion,  work  in  computer  vision  usually  involves 
researach  teams,  because  of  the  range  of  problems 
being  addressed.  It  is  essential,  especially  in  building 
application  systems  of  any  complexity,  that  tools  sup¬ 
port  the  integration  and  capabilities  of  several  people’s 
work.  The  set  of  tools  and  the  programming  environ¬ 
ment  dramatically  shapes  the  culture  for  doing  vision 
research  Donaldson  et.al.  -  83  ,  Kohler  et.al.  -  83it 
Quam  -  84  j. 

Figure  l-l  shows  the  basic  components  that  have 
been  developed  as  part  of  such  an  environment.  The 
object/function  definition  facility  supplies  a  powerful 
and  uniform  set  of  constructs  for  creating  object  types 
and  instances  at  several  of  the  different  levels  necessary 


for  vision  research.  These  include  image  registered 
features,  intermediate  objects  such  as  regions,  and  high 
level  objects  such  as  visual  models.  The  display  facility 
provides  a  uniform  mechanism  for  viewing  objects. 

There  are  four  databases  for  automatically  keep¬ 
ing  track  of  created  object  instances  (Perceptual  Struc¬ 
ture  Database  or  PSDB),  processing  history  (Environ¬ 
mental  Database  or  EDB),  existing  functions  and  object 
type  definitions  Programmers  Database  or  PDB),  and 
stored  results  such  as  images,  objects,  and  particular 
instances  of  the  PSDB,  EDB,  and  LTDB.  These  data¬ 
bases  are  accessed  through  a  uniform  query  language 
or,  optionally  through  a  set  of  interactive  browsers. 
The  database  mechanisms  allow  research  techniques 
and  results  to  be  shared  by  a  distributed  community  of 
users. 


Figure  l-l:  Components 


497 


2.  OBJECTS 


It  is  necessary  to  represent  many  different  types 
of  objects  in  computer  vision.  To  function  within  an 
integrated  environment,  the  method  of  defining,  saving, 
accessing,  displaying,  and  combining  these  objects  must 
be  specified.  In  addition  to  these  common  features,  an 
object  must  also  represent  its  class  specific  properties 
(e.g.,  all  points  have  a  location),  and  perhaps  some 
instance  specific  properties  (e.g.,  this  curve  has  this 
interesting  feature).  The  ability  to  assign  both  class 
and  instance  attributes  to  objects  allows  a  researcher  to 
standardize  well  understood  properties,  but  still  permits 
a  rich,  dynamic  representation  mechanism. 

To  meet  these  and  other  requirements,  objects  are 
implemented  in  an  object-oriented  programming 
methodology.  In  such  a  system,  an  object  and  the  pro¬ 
cedural  forms  used  to  manipulate  it  are  symbolically 
attached.  An  object  is  then  manipulated  by  instructing 
it  to  evaluate  the  symbolically  specified  operation  - 
known  as  sending  the  object  a  message.  This  is  best 
illustrated  by  an  example  as  follows:  In  standard  pro¬ 
gramming,  a  display  routine  that  requires  the  location 
of  an  object  (i.e.,  a  point)  would  access  the  value  with  a 
function  available  in  the  environment,  (location 
point  >).  In  an  object-oriented  system,' the  points 
location  would  be  queried  as  (send  <  point  >  docation). 

When  only  one  type  of  object  is  present  in  the 
environment,  these  two  approaches  are  merely  a  syn¬ 
tactic  transformation.  A  problem  arises,  however,  with 
the  introduction  of  new  types  of  objects.  The  location 
of  a  curve,  for  example,  may  be  defined  as  the 
geometric  center  of  its  bounding  box  as  opposed  to  the 
row  and  column  associated  with  a  particular  point.  To 
enable  the  display  routine  to  manipulate  curve  objects 
as  well  as  points,  the  location  function  must  be 
modified  to  act  differently  depending  on  the  type  of  its 
input.  The  need  to  constantly  modify  the  manipulators 
to  support  new  objects,  quickly  yields  a  bulky, 
inflexible,  unorganized  system. 

In  the  object-oriented  example,  the  curve  object 
symbolically  attaches  the  bounding  box  calculation  to 
the  location  message,  completely  independent  of  the 
characteristics  of  a  point.  The  task  of  specifying  how 
an  object  is  to  be  manipulated  is  distributed  out  of  the 
overloaded  location  operator  and  is  transferred  into  the 
object.  The  object  now  determines  the  implementation 
of  the  semantic  property,  '“location”. 

This  distribution  of  functionality  gives  a  great 
deal  of  power.  From  the  caller's  perspective,  all  objects 
are  seen  to  support  similar  properties,  while  objects 
need  only  be  concerned  with  their  own  manipulators. 
Multiple  representations  of  the  same  object  type  can  be 
supported  as  well  as  multiple  types.  Because  instances 
in  our  environment  can  contain  their  own  specific  pro¬ 
perties,  a  particular  object  can  interpret  a  message 
differently  than  all  others  in  its  class.  This  capability  is 
the  foundation  of  virtual  objects,  described  later  in  this 
section. 


2.1  DEFINED  PERCEPTUAL  OBJECTS 

Several  types  of  objects  in  the  initial  environment 
have  been  defined,  including  images,  points,  curves,  and 
regions.  These  objects  support  the  basic  properties 
common  to  all  objects:  save-to-file,  copy,  and  add  - 
tional  properties  particular  to  each  class.  There  are 
also  composite  objects,  stacks,  and  groups  that 
abstractly  combine  collections  of  other  objects. 

Interesting  combinations  of  the  initial  objects  are 
immediately  apparent.  An  image  can  be  created  where 
each  pixel  is  a  pointer  to  a  list  of  extracted  objects  that 
exist  at  that  location.  This  is  called  a  label  plane  and 
is  used  extensively  for  geometric  reasoning.  A  region 
can  be  bounded  by  a  list  of  curves,  and  a  group  can 
enclose  similar  regions.  Any  desired  combination  is 
possible. 

An  important  feature  of  all  non-image  objects 
such  as  points,  junctions,  curves,  and  regions,  is  the 
representation  of  their  spatial  characteristics.  Any 
representation  should  be  compact,  provide  fast  access, 
and  should  facilitate  most  common  operations.  How¬ 
ever,  there  is  no  optimal  format;  each  must  trade  off 
between  time  and  space  considerations. 

The  initially  defined  objects  allow  the  researcher 
to  choose  among  several  possible  pre-defined  represen¬ 
tations:  arrays,  segments,  chain  codes,  and  quad  trees. 
These  variations  are  possible  with  our  utilization  of 
object-oriented  programming.  When  a  representation 
is  chosen,  the  object  inherits  the  procedures  associated 
with  the  relevant  messages,  and  the  newly  defined 
object  can  immediately  be  manipulated  in  the  environ¬ 
ment. 


Figure  2-1:  Interactions  With  Label  Plane 


498 


As  an  example  of  label  planes  and  system  defined 
object  types,  Figure  2-1  shows  a  display  function  called 
display-values  for  interactively  accessing  and  applying 
functions  to  values  in  images.  In  this  figure,  the  image 
is  a  label  plane  for  the  set  of  extracted  curves  and  the 
result  returned  by  the  function  is  the  description  of  the 
curve  instance  that  is  selected.  The  default  curve  attri¬ 
butes  are  shown  in  small  letters  and  the  associatec 
attributes  are  shown  in  capitals. 

2.2  OBJECT  ACCESS 

As  discussed  above,  all  objects  in  the  environment 
are  accessible  through  a  uniform  set  of  messages 
through  the  object-oriented  programming  methodology. 
This  layer  of  procedural  abstraction  allows  powerful 
generalizations.  A  small  set  of  display  routines  can  be 
applied  to  a  large  number  of  varied  objects.  The  mes¬ 
sages  can  mask  any  complexity  in  describing  an 
object's  physical  characteristics.  Virtual  objects  can  be 
created  that  have  no  internal  data  but  are  procedural 
filters  over  other  objects  already  defined. 

The  uniform  interface  requires  each  object  to 
understand  messages  in  both  a  ID  and  2D  context. 
This  allows  the  system  to  make  no  distinction  between 
objects  that  are  considered  inherently  “linear”  or 
inherently  "planar  ”.  There  are  three  basic  messages  in 
the  uniform  object  interface  each  with  a  one  and  two 
dimensional  version: 


1.  BOUNDARIES  - 

•  ID)  The  number  of  data  points  in  the 
object. 

•  2D)  The  minimum  and  maximum  X 
and  V  value  iu  the  object. 

2.  POSITION  - 

•  ID)  Maps  an  X,Y  position  into  the 
equivalent  ID  offset. 

•  2D)  Maps  a  ID  offset  into  the 
appropriate  X  and  Y  value. 

3.  VALUE  - 

•  ID)  Returns  the  color  at  a  particular 
offset. 

•  2D)  Returns  the  color  at  a  particular 
X  and  Y  position. 


To  put  these  messages  into  perspective,  consider  a 
two  dimensional  image  object.  The  meanings  of  the  2D 
messages  is  clear.  The  BOUNDARIES  message  returns 
the  dimensions  of  the  array,  the  POSITION  message 
returns  the  offset  of  a  specified  row  and  column  from 
the  base  of  the  array,  and  the  VALUE  message  refer¬ 
ences  the  pixel  value  at  a  r  ir  cular  row  and  column. 


The  ID  messages  can  be  interpreted  as  follows:  the 
BOUNDARY  message  returns  the  number  of  pixels  in 
the  image  (i.e.,  Width  *  Height),  the  POSITION  mes¬ 
sage  does  the  inverse  mapping  from  the  array  offset  to 
the  appropriate  row  and  column,  and  the  VALUE  mes¬ 
sage  does  a  lookup  treating  the  array  as  a  verv  long  ID 
vector,  similar  to  the  way  it  is  stored  in  memory. 

The  meaning  of  the  ID  and  2D  messages  are  not 
as  correlated  in  a  curve  object  represented  as  a  list  of 
(X,  Y,  COLOR)  triplets.  The  ID  BOUNDARIES  mes¬ 
sages  returns  the  total  number  of  points  in  the  curve, 
the  ID  POSITION  message  returns  the  X  and  Y  posi¬ 
tion  of  the  nth  point,  and  the  ID  VALUE  message 
returns  the  color  associated  with  the  nth  point.  The 
2D  messages  can  be  interpreted  as  follows:  the  BOUN¬ 
DARIES  message  returns  the  geometric  bounding  box 
(i.e.,  minimum  and  maximum  X  and  Y  values)  of  the 
curve,  the  POSITION  message  returns  the  point 
number  of  the  point  at,  or  closest  to,  a  particular  X 
and  Y,  and  the  VALUE  message  calls  the  2D  POSI¬ 
TION  message  to  find  the  appropriate  point  and  then 
calls  the  ID  VALUE  message  to  look  up  its  color.  This 
is  an  example  of  combining  messages  to  perform  a  com¬ 
plex  action. 

Note  that  this  is  not  the  onlv  interpretation  for  a 
curve  object.  For  example,  the  2D  POSITION  message 
may  only  interpolate  to  the  closest  point  when  the 
specified  X  and  Y  lie  along  the  curve  boundary.  The 
2D  VALUE  message  may  interploate  the  colors  of  the 
two  adjacent  points  rather  than  returning  the  color  of 
the  closest  point.  This  is  a  major  advantage  of  the 
functional  interface.  An  object  decides  how  it's  to  be 
described;  its  description  is  NOT  d  lermined  bv  anv 
environmental  routines. 

This  system  also  supports  multiple  internal 
representations.  If  a  curve  is  associated  with  a  label 
plane,  the  curve  object  can  reference  the  label  plane  to 
answer  the  2D  messages  and  the  (X,Y, COLOR)  list  to 
answer  the  ID  messages.  Again,  this  capability  is 
independent  of  the  interface  system,  it  is  wholly  con¬ 
tained  inside  the  object. 

It  should  be  noted  that  an  object  need  not  under¬ 
stand  every  message.  The  drawbacks  of  such  an 
incomplete  object  are  that  it  may  not  function  properlv 
in  some  display  routines,  and  it  may  not  combine  with 
other  objects  or  other  procedural  filters  to  create  com¬ 
pound  or  virtual  objects. 

2.3  VIRTUAL  OBJECTS 

A  virtual  object  looks  and  acts  like  any  other 
object  in  the  environment;  however,  the  procedures 
underlying  its  access  messages  modify  the  descriptions 
of  other  object?  without  duplicating  their  underlying 
representation.  In  other  words,  a  virtual  object  is  just 
a  filter  around  the  messages  of  other  objects.  A  simple 
example  is  a  virtual  image  (V-IMAGE)  which  is  the 
inverse  of  an  existing  image  (IMAGE).  V-IMAGE  uses 
the  BOUNDARIES  and  POSITION  message  from 
IMAGE  unchanged,  but  modififies  the  ID-VALUE  mes¬ 
sage  as  follows: 

’(lambda  (index) 

(logxor  -1  (send  IMAGE  :  ID- VALUE  index))) 


499 


There  is  a  speed  disadvantage  in  this  virtual  object 
because  the  XOR  calculation  must  be  done  each  time  a 
pixel  is  accessed.  However, there  is  tremendous  space 
efficiency  because  V- IMAGE,  a  unique,  new  image,  is 
created  at  the  cost  of  a  few  lambda  expressions. 

Virtual  objects  are  especially  important  in  a 
display  system  where  many  minor  modifications  of 
existing  objects  are  desired  to  tune  the  appearance  of 
the  display.  Rather  than  having  a  fixed  set  of  these 
tuning  options  in  the  display  routines,  such  as  scaling, 
inverting,  adding,  etc.,  any  combination  that  can  be 
expressed  as  a  lambda  expression  can  be  specified. 
Binary  masks  can  be  used  to  conditionally  modify  the 
value  of  a  full  8-bit  image.  A  label  plane  of  pointers  to 
objects,  that  is  not  a  directly  dl  playable  image,  can  be 
run  through  a  filter  that  returns  an  integer  based  on 
the  attributes  of  the  object  pointed  at  in  the  label 
plane  in  any  given  position.  Because  a  virtual  object  is 
itself  a  complete  object,  answering  all  the  messages 
specified  in  the  uniform  interface,  it  can  be  used  recur¬ 
sively  as  a  data  source  in  a  new  virtual  object. 

Virtual  objects  also  provide  an  excellent  interface 
to  various  types  of  display  hardware.  An  8-bit  image 
can  be  displayed  on  a  monochrome  display  by  mapping 
some  values  to  one.  and  the  rest  to  zero.  Three  8-bit 
images  can  be  combined  into  a  ‘24-bit  virtual  image  by 
adding  together  the  first  image  value,  the  second  image 
value  shifted  by  8  bits  and  the  third  image  value 
shifted  by  16  bits. 

V  irtual  objects  can  also  map  over  localized  neigh¬ 
borhoods  in  the  source  object.  Gradients,  convolution, 
and  median  filtering  are  all  possible  -  without  ever 
creating  a  new  image.  An  example  is  a  virtual  image 
that  returns  the  maximum  value  in  a  2  by  2  neighbor¬ 
hood  in  a  source  image.  The  2D-VALUE  message  of 
such  a  virtual  image  is: 


'(lambda  (row  col) 

(max  (send  IMAGE  :2D-VALL'E  row  col) 
(send  IMAGE  :2D- VALUE  (1-  row)  col) 
(send  IMAGE  :2D- VALUE  row  (l-  col)) 
(send  IMAGE  :2D- VALUE  (1-  row)  (1-  col)) 

Referencing  a  pixel  in  this  virtual  image  is  at  least  four 
times  as  expensive  as  referencing  a  pixel  in  the  source 
image  but  again,  there  is  a  tremendous  space  savings. 
Note  that  the  virtual  image  described  above  is  actually 
one  pixel  smaller  than  the  source  image  (i.e.,  the  source 
image  must  always  have  at  least  one  row  and  one 
column  beyond  any  call  to  2D- VALUE).  This  is  accom¬ 
plished  by  filtering  the  BOUNDARIES  message  of  the 
original  image  through  a  function  that  reduces  the 
dimensions  by  one. 

As  demonstrated  in  the  above  example,  virtual 
objects  are  not  restricted  to  modifying  only  the 
VALUES  message  of  their  source  objects.  By  modify¬ 
ing  the  POSITION  and  BOUNDARIES  messages,  other 
powerful  effects  are  possible.  For  example,  to  reduce  an 


image  by  a  factor  of  two  by  averaging  local  neighbor¬ 
hoods,  the  BOUNDARIES  message  divides  the  orginal 
bounds  by  2,  the  POSITION  message  maps  from 
reduced  pixel  locations  to  their  original  position  (by 
multiplying  the  row  and  column  values  by  2),  and  the 
VALUES  message  does  an  average  across  the  neighbor¬ 
hood  in  the  original  im;  ge  to  produce  the  reduced 
value. 

Virtual  curves  and  regions  can  be  created.  The 
VALLE  message  can  easily  be  mapped  as  described 
above.  A  curve  can  be  zoomed  by  any  factor  by 
appropriately  modifying  the  BOUNDARIES  and  POSI¬ 
TION  messages.  A  virtual  curve  can  also  be  created 
that  evenly  redistributes  the  points  in  a  source  curve  in 
fixed  distance  increments,  or  in  a  fixed  number  of  total 
points.  Multiple  curves  can  be  attached  end  to  end  by 
modifying  the  BOUNDARIES  message  and  dispatching 
to  the  appropriate  source  curve  in  the  POSITION  and 
VALUE  messages. 

A  function  to  help  the  user  create  virtual  objects 
(called  MVO)  is  included  in  the  environment.  It  has 
many  options,  some  of  which  are  described  below: 

1)  PMF  (Pixel  Mapping  Function):  This  option 
is  used  for  virtual  objects  that  maintain  the 
spatial  attributes  of  their  source  objects,  but 
modify  the  returned  value.  The  user  supplies 
<n>  source  objects  and  a  lambda  expression 
°f  <n>  values.  The  VALUE  message  in  the 
newly  created  virtual  object  has  the  effect  of 
applying  the  lambda  expression  to  correspond¬ 
ing  pixels  in  each  of  the  source  objects.  For 
example: 

(mvm^:pmf  '(lambda  (pl)(logxor  -1  pi)) 
yields 

:  ID- VALUE  = 

’(lambda  (index) 

(logxor  -l 

(send  IMAGE  : ID- VALUE  index))) 

PMF  also  understands  some  keyword 
options  that  produce  the  lambda  expressions 
that  are  then  incorporated  into  the  virtual 
object.  An  example  of  such  a  usage  is: 

(mvo  :pmf  '(:scale  1000  2000  250  0)  IMAGE) 

that  scales  image  values  1000  to  2000  into  the 
range  250  to  0.  The  actual  lambda  expression 
that  does  the  scaling  is  generated  inside  of 
MVO. 


500 


2)  NMF  (Neighborhood  Mapping  Function): 
This  option  is  used  for  virtual  objects  that  are 
neighborhood  maps  over  a  single  source 
object.  The  user  supplies  a  neighborhood  size, 
an  optional  reduction  amount,  and  a  lambda 
expression  of  <n>  args  where  <n>  is  the 
neighborhood  size.  The  output  object  is 
reduced  from  the  source  by  the  reduction  fac¬ 
tor  and  applies  the  neighborhood  function  to 
the  appropriate  region  to  supply  the  value  in 
the  virtual  object.  A  horizontal  gradient 
example  is  specified  as: 

(mvo  :N'MF  '(minus  :neighborhood  (1  2)) 
IMAGE). 

The  neighborhood  average  example  is 
expressed  as 

(mvo  :NMF  ’(avg 

neighborhood  (2  2) 
reduction  '(2  2))  IMAGE). 

3)  CONNECT:  This  option  builds  a  virtual 
curve  that  connects  any  number  of  source 
curves  as  described  above.  It  is  created  as  fol¬ 
lows: 

(mvo  :eonnect  ’(:wrap-around  t)  Cl  C2  C3) 

The  :wrap-around  options  indicates  that  the 
last  point  should  be  connected  to  the  first 
point. 

3.  PROGRAMMING  CONSTRUCTS 


Our  preferred  base  language  is  COMMONLISP 
because  it  has  the  flexibility  of  any  LISP  dialect  and  is 
a  standard  language  available  on  a  wide  variety  of 
hardware.  In  addition  to  the  base  language,  constructs 
are  needed  for  expressing  common  image  understanding 
idioms  in  a  concise,  efficient,  and  portable  manner. 
Having  such  programming  constructs  maximizes  the 
productivity  of  a  researcher  by  allowing  an  algorithm 
to  be  expressed  at  a  conceptual  level  without  concern 
for  its  efficient  implementation  on  a  specific  machine. 
The  details  of  generating  the  efficient  code  required  by 
image  understanding  programs  can  be  left  to  an  expert 
on  a  particular  machine. 

Programming  constructs  for  image  understanding 
have  two  conflicting  requirements:  to  provide  as  much 
generality  and  abstraction  as  possible  while  running  as 
efficiently  as  possible.  Many  programming  environ¬ 
ments  provide  a  variety  of  different  constructs  to  write 
efficient  code  on  a  particular  machine.  The  problem 
with  this  approach  is  that  the  code  is  no  longer  port¬ 
able  and  requires  full  knowledge  of  the  tradeoffs  on  a 
particular  machine.  The  approach  we  have  taken  is  to 
use  macros  to  create  higher  level  language  constructs. 
These  are  then  compiled  into  code  optimized  for  the 
way  that  a  construct  is  being  used  and  the  require¬ 
ments  of  a  particular  machine.  This  allows  the 
specification  of  algorithms  in  a  concise  and  portable 
fashion,  while  still  hiding  the  details  of  efficient  imple¬ 
mentation. 


The  two  general  classes  of  image  understanding 
programming  constructs  are  object  constructs  and 
expansion  environments.  Object  constructs  provide 
well  defined  ways  to  create,  access  and  modify  objects. 
They  hide  the  implementation  of  the  objects.  Expan¬ 
sion  environments  provide  the  context  and  binding 
required  to  make  repetitive  object  operations  more 
efficient. 

A  typical  expansion  environment  is  one  that 
specifies  an  iteration  over  neighborhoods  in  an  image, 
storing  the  result  of  applying  some  function  over  that 
neighborhood  into  another  image.  The  construct  pro¬ 
vides  defaults  that  can  be  overridden  for  figuring  the 
bounds  of  looping  and  boundary  checking.  The  image 
access  constructs  inside  the  environment  expand  into 
code  that  ’knows’  that  accesses  are  being  done  in  neigh¬ 
borhoods  and  what  sort  of  boundary  checking  is 
required.  Other  expansion  environments  provide  stan¬ 
dard  feature  interface  and  error  checking  or  recovery 
rode. 

One  specific  expansion  environment  that  we  have 
developed  is  called  DEFIU.  It  provides  standard  key¬ 
word  arguments,  error  checking  and  recovery  code, 
database  management  code,  hooks  for  interactive  func¬ 
tion  application,  boundary  checking,  and  efficient  loop 
generation.  It  contains  knowledge  about  providing 
defaults  and  efficiently  implementing  common  image 
processing  idioms.  All  of  its  features  are  controlled 
through  keywords.  It  assembles  a  program  from  con¬ 
text  information  and  code  fragments.  DEFIU  has 
proved  to  be  very  useful  in  writing  general  image  pro¬ 
cessing  programs.  .Vs  an  example  of  the  use  of  DEFIU. 
see  Figure  3-1  showing  the  definition  of  chamfer.  Each 
line  of  this  function  maps  into  approximately  20  lines 
of  more  primitive  lisp  code. 


Cenerate  a  distance  map  for  a  binary  image  by  chamfering, 
idefiu  chamfer  (binary-image) 

"Take  a  binary  image  and  compute  the  distance  at  each  point  in  the 
grid  from  the  nearest  object." 

menu  curve 

input  (binary-image) 

output  lehamfer-irnage  :arrav-type  'art-q  ) 

do  (pset  (iT  ( =  ( pget  binary-image)  0)  1000000000.0 
0.0) 

chamfer-image) 

in  3  by  3 

centerr  I  centerc  1 

do  (pset  (min  (pget-r  chamfer-image  0  0  ) 

(  -  3.0  (pgeWr  chamfer-image  0  - 1 )) 

(  -  3,0  (pget-r  chamfer-image  -l  -11) 

J  -  2.0  (pget-r  chamfer-image  -1  0)1 
(-  3.0  (pget-r  chamfer-image  -l  1))) 
chamfer-image) 

-direction  -  rdirect'mn  - 

do  (pset  (min  I  pget-r  chamfer-image  0  0  ) 

(  -  2.0  (pget-r  chamfer-image  0  t  )) 

(-  3.0  (pget-r  chamfer-image  1  l  )i 
(  -  2.0  (pget-r  chamfer-image  l  0  H 
(-  3.0  (pget-r  chamfer-image  1  -lJJ) 
chamfer-image)) 


Figure  3-1:  DEFIU 


501 


One  of  the  most  important  of  the  arguments  to  a 
DEFIU  function  is  the  Function  Application  Mask 
(FAM)  keyword.  The  F.AM  is  a  function  that  can  be 
passed  into  a  DEFIU  function  to  determine  whether 
that  function  is  applied  to  each  individual  pixel.  This 
mechanism  can  be  used  to  restrict  the  application  of  a 
function  to  a  particular  part  of  an  image  based  on  the 
row  and  column  index,  a  mask  image,  the  pixel’s  value 
or  any  test  that  can  be  done  with  the  full  flexibility  of 
LISP. 

DEFIU  provides  error  checking  and  recovery  facil¬ 
ities  to  insure  that  objects  are  of  the  appropriate  type 
and  that  any  objects  created  are  destroyed  if  the  func¬ 
tion  has  an  error  or  is  aborted.  This  sort  of  error 
recovery  is  important  when  writing  functions  to  ensure 
that  the  environment  is  not  filled  with  objects  that 
have  corrupted  data. 

The  default  boundary  checking  in  DEFIU  ensures 
that  no  reference  will  ever  be  outside  the  bounds  of  an 
object.  This  default  can  be  overridden  on  each  image 
individually,  so  a  constant  value  will  be  returned  by 
any  reference  outside  the  image.  Specifying  boundary 
checking  once  for  an  object,  rather  than  at  every 
access,  has  proven  to  be  a  very  useful  feature. 


4.  SYSTEM  DATABASES  AND  BROWSERS 


Several  databases  have  been  develolped  that 
automatically  record  objects  and  their  interactions. 
This  provides  a  mechanism  for  storing  and  finding 
objects  and  their  relations.  Four  different  databases 
have  been  developed:  the  Perceptual  Structures 
Database  (PSDB),  the  Environmental  Database  (EDB), 
the  Long  Term  Database  (LTDB),  and  the  Program¬ 
mers  Database  (PDB).  These  databases  are  imple¬ 
mented  as  lists  of  objects,  so  that  additional  databases 
can  be  easily  created  by  an  application.  The  PDB  and 
the  LTDB  are  maintained  in  disk  files  that  are 
automatically  read  when  the  system  is  started.  All  of 
these  databases  are  accessible  through  a  common  and 
extensible  query  language  and  interactive  browsers. 


4.1  PERCEPTUAL  STRUCTURES  DATABASE 

The  Perceptual  Structures  Database  (PSDB)  con¬ 
tains  all  of  the  globally  accessible  perceptual  structures 
generated  by  an  application.  The  perceptual  structures 
include  points,  edges  and  regions,  as  well  as  recursive, 
structure-based  groupings  of  these.  The  database  has 
two  primary  purposes:  to  make  an  image  understand¬ 
ing  environment  easier  to  use  by  providing  a  central 
place  to  find  perceptual  objects,  and  to  provide  a 
means  of  communication  between  different  parts  of  an 
application.  With  this  database,  and  tools  for  explor¬ 
ing  it  such  as  filters  and  browsers,  a  user  can  refer  to 
objects  by  referring  to  their  characteristics  as  well  as 
their  names. 


The  PSDB  contains  images  and  all  of  the  objects 
extracted  from  them.  Because  of  its  potentially  large 
size,  the  PSDB  is  organized  hierarchically,  with  only 
large-grained  objects,  such  as  images,  stacks,  pyramids, 
and  groups,  at  the  top  level.  This  has  two  advantages: 
it  combines  the  information  in  the  database  to  make  it 
easier  to  use  and  it  makes  queries  more  efficient  by 
quickly  ruling  out  potential  matches  for  a  query.  The 
general  combining  mechanism  is  a  construct  called  a 
group  that  contains  a  list  of  objects.  A  group  inherits 
some  of  the  properties  of  its  contents  and  can  be  used 
with  all  of  the  database  tools.  Each  top-level  entry  in 
the  PSDB  has  slots  for  time  of  creation,  parents,  chil¬ 
dren,  name  and  number.  The  name  slot  contains  the 
name  of  a  global  variable  that  is  bound  to  that  object 
and  the  number  slot  contains  a  number  that  refers  to 
that  object  as  well.  The  parents  slot  and  the  children 
slot  are  used  to  make  connections  to  the  entries  in  the 
EDB.  Through  these  connections,  a  user  can  look  at 
the  complete  processing  history  of  an  object. 

4.2  ENVIRONMENTAL  DATABASE 

The  Environmental  Database  (EDB)  records  func¬ 
tion  calls,  their  input  parameters,  and  their  results. 
Each  record  in  the  EDB  consists  of  a  time  stamp,  the 
name  of  the  function  applied,  the  parameters  supplied 
to  the  function,  the  input  Perceptual  Structure  Data¬ 
base  Objects  (PSDBOs),  the  output  PSDBOs,  notes 
included  by  the  function  writer,  notes  added  by  the 
function  caller,  and  the  name  of  the  person  that  called 
the  function.  When  an  entry  is  made  in  the  EDB,  that 
entry  is  added  to  the  children  of  each  of  the  input 
PSDBOs  and  to  the  parent  of  each  of  the  output 
PSDBOs.  In  the  example  Figure  4-1,  the  EDB  records 
generated  by  a  call  first  to  “vgradient”,  and  then  to 
gradient-non-maxima-suppression  are  shown. 
Gradient-non-maxima-suppression  vector,  called  by 
gradient-non-maxima-suppression-vector,  actually  gen¬ 
erated  the  returned  result.  By  querying  the  EDB.  all  of 
the  PSDBOs  generated  by  vgradient  or  all  records  that 
have  a  specific  note  attached  to  them  can  be  found. 
Since  PSDBOs  are  linked  to  records  in  the  EDB.  the 
complete  processing  history  of  an  object  can  be  shown. 
This  history  is  useful  to  track  down  why  an  error 
occurred.  The  availability  of  object  histories  allows 
both  manual  and  automatic  storage  management. 
Since  objects  in  the  PSDB  are  potentially  large  and 
cannot  be  reclaimed  by  normal  LISP  garbage  collection 
while  they  are  in  a  database,  it  is  necessary  to  be  able 
to  remove  objects  from  the  PSDB.  This  creates  a  prob¬ 
lem  for  the  EDB  because  the  pointers  to  deleted  PSDB 
objects  must  be  removed  from  function  records.  This  is 
handled  by  replacing  pointers  to  a  deleted  PSDB  object 
with  the  object's  parent  EDB  record.  In  this  way, 
space  is  traded  for  time.  The  exact  result  can  be  regen¬ 
erated  by  using  the  parent  EDB  record  to  call  the  same 
function  with  the  same  parameters. 


502 


T*m  I'VGfMUENT 
Nm  ULVQnMNENT 

pmmmh  muffle »ae63M^i  famnh  \£vcio  *notes-) 

Output  MOHAOENT  MAGE  15  15  M6 1032)) 

Sytatm-NMM  ’VyMM t*  ciaMMd  *•20,2  nwontwnood* 
UnrMDM  *nMW«te  *M  »•  rfipul  *we«  hw  Dmo  amootfirt  tanca  * 


Typa  I  GRAO*NTNOMMAXMA  SUPPRESSION 

Pmmmm  (ffGHADCNT  MAGE  15  15  66610321  ’  0*8  USE-GRAOCNT-MAGMTtlOC  NIL  NOTES 
Outputs  (#(MAGE  16  16  66609S2I  *MAGE1 7  1 7  66510631) 

Typt  IGRAOtENT  NON  kMXMA  SUPPRESSION- VECTOR 

KwamaNrs  (SfGRACHENI  MAGE  15  15  6651032)  IOeI  F  AM  NIL  LEVEL  0  NOTES") 

Outputs  |  •(  MAGE  16  16  66509921  IMMAGE  I 71?  66S10«3|) 


Figure  4-1:  Environmental  Database  Object 


4.3  LONG  TERM  DATABASE 

The  Long  Term  Database  (LTDB)  contains 
descriptions  of  all  permanent  PSDB  objects.  The 
LTDB  helps  when  developing  an  application  to  manage 
the  large  number  of  objects  that  are  stored.  Having  a 
large  number  of  images  and  preprocessed  results  avail¬ 
able  saves  time  in  testing  image  understanding  routines 
and  makes  it  possible  to  test  a  routine  on  a  variety  of 
data  sets.  Each  entry  has  a  name,  type,  creation-date, 
source,  documentation,  directory,  files,  keywords  and 
history  slot. 


4.4  PROGRAMMERS  DATABASE 


«<P08  IUL  VQRA0*NT> 

Typa:  DERU 
Name:  iULVGRADCNT 
Parameter*:  (IMG) 

Documentation'.  *CalculMa  the  grade*  at  each  point  in  an  image  using  2  Dy  2  neighborhood* ' 

Author:  CHRIS 

Oat*.  ’2AJ2/W  \tH3A2 

Pile:  IUL:  RM00;  MAGEFUNSBIN. NEWEST 

Keyword*:  (UTILITY  'Calcutta  the  grade*  at  each  point  in  an  image  using  2  by  2 
neighborhoods'  CHRIS  IUL:  RMOD;  IMAGEFUNS.BtN.NEWEST  IUL.VGRADIENT 

Figure  1-2:  Programmers  Database  Object 


4.5  FILTER  FUNCTIONS 

A  query  language  is  used  to  access  the  databases. 
The  language  is  designed  with  the  same  kind  of  modu¬ 
larity  and  combination  that  is  found  in  Unix  pipes,  but 
with  more  flexibility.  The  language  is  based  on  the 
combination  of  filter  functions  into  filters. 

A  filter  is  a  macro  that  generates  LISP  from  filter 
functions  and  the  logical  combiners  AND.  OR,  NOT. 
A  filter  function  takes  a  list  as  its  first  parameter,  and, 
together  with  optional  additional  parameters,  returns  a 
list  as  its  result.  The  logical  combiners  specify  how  the 
results  of  filter  functions  are  piped  into  other  filter 
functions.  This  includes  the  generation  of  temporary 
bindings  and  set  union  (OR),  or  difference  (NOT)  code. 
Filters  are  efficient  because  they  expand  into  some 
optimal,  but  harder  to  understand  LISP  code.  They 
are  very  flexible  because  the  only  convention  is  that  the 
first  parameter  and  the  results  are  lists.  There  are  no 
constraints  on  what  the  elements  of  list  are,  although 
they  are  usually  objects.  What  happens  in  a  filter 
function  can  range  from  simple  attribute  checking  to  a 
complex  search  or  pattern  matching  operation.  The 
objects  returned  are  usually  a  subset  of  the  original 
objects,  but  can  be  a  superset  or  even  a  completely 
different  set  of  objects.  There  are  three  different  gen¬ 
eral  classes  of  filter  functions:  selectors,  transformers, 
and  modifiers. 


The  Programmers  Database  (PDB)  automates  the 
management  of  function  and  object  definitions.  It 
enables  a  user  to  browse  through  the  code  generated  by 
others  to  avoid  duplicating  work.  This  is  particularly 
important  where  the  environment  is  being  used  by 
many  people  working  on  different  applications  who  are 
not  directly  communicating  with  each  other. 

Automatic  maintenance  of  the  PDB  is  important 
because  users  might  not  spend  the  effort  needed  to 
maintain  it  manually.  Entries  in  the  PDB  are  automat¬ 
ically  updated  each  time  a  file  is  compiled.  It  is  also 
possible  to  generate  documentation  from  the  documen¬ 
tation  in  files. 

Entries  in  the  PDB  have  slots  for  the  type,  name, 
parameters,  documentation,  file,  author,  and  date  of 
last  modification.  Figure  4-2  shows  the  PDB  entry  for 
the  vgradient  function.  Using  the  information  in  these 
fields,  it  is  easy  to  use  the  text  browser  to  find  code 
with  specific  properties. 


“Selectors"  produce  a  subset  of  the  original  list  as 
their  result.  This  is  the  most  common  type  of  filter 
function.  “Transformers"  take  in  a  list  and  produce  a 
list  of  completely  different  objects.  “Modifiers"  change 
some  aspect  of  the  objects  and  then  return  the  objects 
as  their  result.  Side  effects  in  a  query  can  be  very  use¬ 
ful.  For  example,  long  edges  can  be  chosen,  their  orien¬ 
tation  computed  and  stored,  and  then  only  the  horizon¬ 
tal  edges  selected. 

Because  elements  in  the  lists  are  usually  objects, 
there  are  a  number  of  generic  filter  functions  in  the 
core  system.  Some  of  these  are  rangef,  memberf, 
stringf,  and  timef.  “Rangef"  selects  objects  that  have 
a  slot  value  between  two  other  values  (or  equal  to  the 
first  if  there  is  only  one  parameter).  “Memberf  selects 
objects  with  a  slot  value  that  is  a  member  of  its  second 
parameter.  “Stringf”  selects  objects  that  have  its 
second  parameter  as  a  substring  of  the  objects  slot 
value.  “Timer’  selects  objects  that  have  a  slot  value 
between  times.  “Collectf"  is  a  transformer  that  col¬ 
lects  all  of  the  objects  slot  values  into  a  list  and  then 
returns  the  list.  There  are  also  a  number  of  special 
purpose  filter  functions  for  specific  applications.  These 
functions  can  be  combined  and  used  in  many  different 
situations. 


503 


Some  example  filters  include: 

(filter  *pdb* 

imemberf  :author  ’(chris  phil)) 
timef  :date  "Jan  l") 
or  (stringf  :key words  "edge") 

(stringf  :key words  "curve”))) 

Return  all  entries  in.  the  Programmers 
database  that  were  written  by  chris  or  phil 
after  January  1  of  this  year  that  have 
something  to  do  with  edges  or  curves. 

(filter  13 

(near-object  ‘curve*  10) 

(within-filter  :avg-intensity  3)) 

Return  all  curves  in  group  13  (an  application 
specific  database)  that  are  within  10  pixels  of  the 
curves  position  in  its  label  plane  and  are  within 
3  of  the  objects  average  intensity. 


im 

MM 

nnNM 

Store  •  region*  spin*  fro 

KfUH 

lULil*C-W*6lWI6Id* 

Extract  CurvM  Wtnwn  Id 

- 

*HU 

Extract  •pin**  vh#r*  the 

•mu 

tU.A6ilt.-CMM*  IF 

AMS*. 

PFFlU 

UAitHtMCT-akvCS 

curve! 

SIFJO 

tu.:txtp*Ct-JUHCIia& 

Extract  Junction*  r r on  • 

Sfnu 

IUL  iCKtMCI  -CUTVt*  J 

xirvts  ~~ 

knu 

IUL  i  JUNCf  I0M-IML1 

;<v*n  an  initial  petition 

KFtU 

IUI  :Cy*VC-MAL« 

Ualfc  a  curw  where  r  and 

•triu 

IUL  i »CNt -CUFVC-FOI III 

rind  the  next  point  on  a 

(CFIO  1 

tUL  ;FX1F*C  1  -HU  IIFT  f  -C0I* 

curt** 

■ 

SCFIU 

Hk:*ICM*0FM»P  tVFt 

ClFvts-  ' 

■ 

DCFIU 

IUL;Cl#tJC-lMIM 

3MK 

" 

OCHU 

stare  e  region*  spine  fro 

MFUH 

tUL:LK-frU*DIV!6I0M 

Extract  curve*  aetueen  id 

PfFIW 

lutrsrnc-cxiMM 

Extract  *pine*  where  the 

- 

IUL :  16*1 -CMMMFCF 

curve 

■ 

CCFUM 

C*OUPC*:IHII-FCPII 

- 

DfFUM 

C*O>t*:DEP06X!-mi*S 

Deposit  the  spine*  <n  the 

“ 

g*OUPC»:  JMT -•«»««& 

PC  FUII 

G»OUPC»:l*CI0N-$F|IC 

KFUM 

QftOUPOitKGIOM-ClVlMtUPt 

P£fl«l 

CPOUPtB  :FCPM-*I  IMPt  <4I-Ff4 

PC  FIJI! 

q»(X.*CP:PCC10Nt 

If 

Figure  4-3:  Text  Browser 


4.6  BROWSERS 

Browsers  are  an  interactive  query  mechanism  for 
finding  and  exploring  relationships  between  objects. 
Three  different  types  of  browsers  have  been  designed 
and  partially  implemented:  a  general  text-oriented 
browser,  an  image  browser  and  a  graph  browser.  All  of 
the  browsers  maximire  the  amount  of  information 
breadth  or  depth  presented  at  one  time  and  provide  a 
means  to  interactively  apply  functions  to  selected 
objects.  They  follow  the  filter  function  convention  of 
using  lists  as  their  first  parameter  and  of  producing 
lists  as  results,  so  they  can  be  used  in  the  midst  of  a 
filter. 


4.6.1  Text  Browser 

The  text-oriented  browser  works  with  any  object 
and  provides  an  easy  interface  to  interactively  apply 
functions.  There  are  two  different  display  modes.  The 
"object  'line  mode"  presents  each  object  on  a  single  line 
of  text  divided  into  fields  and  maximizes  the  breadth  of 
information  that  can  be  seen  at  one  time.  The 
"field  line"  mode  displays  one  object  at  a  time  with 
one  field  per  line,  maximizing  the  depth  of  information 
displayed  about  that  one  object.  The  object/line 
display  is  useful  when  looking  for  a  specific  set  of 
objects.  As  the  set  of  objects  is  narrowed  down,  the  set 
of  objects  that  meet  selected  specifications  are  shown. 
The  field/line  display  is  useful  when  stepping  through  a 
subset  of  the  objects  in  a  database,  looking  at  each 
object  in  detail.  Figure  4-3  shows  the  text  browser 
bcinst  applied  to  the  PDB. 

4.6.2  Image  Browser 

The  image  browser  displays  reduced  resolution 
pictures  of  objects.  Usually  a  text  browser  or  a  filter  is 
used  to  narrow  down  the  number  of  images  to  browse. 
The  image  browser  can  be  used  to  scroll  through  the 
pictures,  look  at  their  entries  in  the  LTDB  and  load  the 
objects.  The  pictures  are  generated  automatically  on  a 
request  to  browse  a  specific  object  and  then  stored  so 
that  if  that  object  is  browsed  again,  the  browser  can 
use  the  previously  generated  picture. 


4.6.3  Graph  Browser 

The  graph  browser  displays  objects  and  the  rela¬ 
tionships  between  them.  In  the  same  way  that  the 
image  browser  shows  what  a  stored  object  looks  like 
more  clearly  than  its  textual  description,  the  graph 
browser  shows  the  connections  between  objects  more 
clearly.  It  can  be  used  to  look  at  the  relationships 
between  objects  in  the  PSDB  and  EDB,  or  to  trace  the 
processing  flow  through  object  models. 


6.  DISPLAY  SUBSYSTEM 


The  display  system  is  built  around  the  family  of 
messages  described  in  the  section  on  object  access. 
Each  routine  contains  two  main  pieces;  an  iteration 
scheme  for  scanning  and  accessing  the  object  using  the 
basic  messages,  and  the  code  needed  to  display  the 
object  on  the  screen.  Because  all  objects  can  be 
accessed  the  same  way,  a  small  set  of  display  functions 
can  be  used  on  a  wide  variety  of  objects.  The  use  of 
virtual  objects  enhances  and  simplifies  the  display  sys¬ 
tem.  Rather  than  having  options  to  scale,  reduce, 
zoom  (etc)  images  in  the  display  routine,  a  virtual 
object  can  be  created  that  captures  the  desired  func¬ 
tionality.  The  display  function  can  concentrate  simply 
on  presenting  an  object. 

The  user  interacts  with  the  display  system 
through  a  single  function  (the  "D"  function)  as  shown 
in  Figure  5-1.  This  function  plays  two  roles.  It  han¬ 
dles  many  of  the  tedious  details  associated  with  window 
management,  and  it  provides  a  uniform  syntax  for  issu¬ 
ing  display  commands.  The  current  pan,  zoom,  offset, 
sub-image  and  sampling  rate  for  each  window  are 
stored  in  an  internal  database  and  can  be  used  as  the 
default  in  subsequent  displays  (for  displays  that  sup¬ 
port  these  features).  Windows  can  be  assigned  as 
viewports  on  other  windows  and  are  automatically 


504 


In  general,  a  display  is  colored  based  on  the  value 
of  the  underlying  object.  However,  this  default  can  be 
overridden  by  over-coloring,  another  display  option. 
For  example,  a  mountain  of  elevation  200  can  be 
displayed  in  color  200  in  a  standard  call  to  the  surface 
display.  However,  it  is  often  desirable  to  guide  a 
display  with  one  set  of  objects,  and  color  it  with 
another.  A  surface  display  may  use  a  terrain  image  as 
the  source  of  elevation  data,  and  a  feature  image  to 
color  each  point.  Another  example  is  to  use  a  horizon¬ 
tal  gradient  and  vertical  gradient  image  to  display  a 
vector  field  but  use  the  absolute  magnitude  of  the  pixel 
underlying  the  vector  as  the  color.  Both  these  tasks 
are  straightforward  using  the  over-color.  Over-colors, 
similar  to  FAMs,  are  objects  in  direct  correspondence 
with  an  object  being  displayed.  Rather  than  serving  as 
a  binary  mask,  it  is  used  as  the  color  indicator.  Over¬ 
colors  can  also  be  fixed  values  (as  opposed  to  objects)  if 
a  monochrome  display  is  desired. 


1)  Display  Zero-Crossings  in  Red: 

M  :edge  (mvo  :PMF  (smooth-n  image  2)  (smooth-n  image  4)) 
:edge-function  ’(lambda  (pi  p2)  (not  (=  (sign  pl)(sign  p2)))) 
:over-eolor  red) 


2)  Contour  plot  every  ‘contour*  levels: 

(d  :edge  image  :edge-function  '(lambda  (pi  p2) 

1  (not  (=  (//  pi  ‘contour*) 


3)  Vector  display  of  the  gradient  colored  by  the  magnitude: 

(d  : vector  (mvo  mmf  (#’-  meighborhood  ’(12)  :reduction  '(1  1))  image) 
(mvo  mmf  (#  -  meighborhood  ’(2  1)  :reduction  (1  1))  image) 
•rtv#F-<*nlor  imifel 


4)  Display  a  virtual  closed  curve  composed  of  3  connected  smaller  curves 
(d  :curve  (mvo  : connect  (:wrap-around  t)  curvel  curve2  curve3)) 


Figure  5-1:  Display  Examples 


The  following  display  routines  are  available  in  the 
initial  environment. 


a  Pixel:  Basic  display. 

a  Edge:  A  pixel  and  its  neighbors  are  submitted 
to  an  edge  function,  and,  if  this  function 
returns  non-zero  or  non-nil,  the  pixel  is  plotted. 

a  Vector:  Expects  two  images,  the  first  specifies 
the  horizontal  component  of  each  vector  and 
the  second  is  the  vertical  component.  Vectors 
are  based  on  imaginary  grid  lines  through  the 
image.  The  grid  position  and  size  can  be 
specified  by  the  user.  FAMs  can  also  be  used 
to  mask  off  the  grid. 

a  Surface:  Perspective  display  of  an  image  inter¬ 
preted  as  elevation  data.  Appears  as  a  warped 
grid  of  lines.  Position,  elevation  above  the 
ground,  pan  and  tilt  angle,  maximum  visibility, 
field  of  view,  and  resolution  can  all  be  specified. 
This  routine  optionally  builds  a  visibility  mask; 
a  binary  array  where  pixels  visible  from  the 
current  orientation  are  indicated. 

•  Paint:  Surface  display  with  shading.  This  rou¬ 
tine  optionally  stores  the  inverse  transform 
from  perspective  coordinates  back  to  image 
coordinates  for  every  point  on  the  viewing  sur¬ 
face. 

•  Curve:  Iterates  as  follows  over  its  arguments: 
For  all  ID  indices,  at  the  2D  POSITION,  plot 
the  ID  VALUE,  connecting  adjacent  points. 
Like  the  region  display  below,  this  routine  is 
optimized  for  particular  objects,  namely  ID 
curves.  Note  that  an  image  is  also  displayable 
with  this  routine. 

•  Region:  Iterates  as  follows  over  its  arguments: 
For  all  rows  and  all  columns  in  the  2D  BOUN¬ 
DARIES,  plot  the  2D  VALUE  at  row, column. 
This  routine  is  optimized  for  region  objects 
stored  as  chain  codes  but  also  functions 
correctly  (if  slowly)  on  curves  and  images. 


updated  as  the  main  window  is  changed.  All  display 
commands  are  stored  and  can  be  reexecuted  and  cycled 
on  a  window  by  window  basis.  Each  window  can  be 
assigned  a  set  of  colors  (or  dashed  line  patterns  on  a 
monochrome  display)  that  can  be  used  or  cycled  in  par¬ 
ticular  displays.  The  interface  function  also  checks  for 
errors  in  the  calling  parameters  of  a  particular  display 
routine  before  the  function  is  called,  and  can  route  a 
display  request  to  an  optimized  routine  when  appropri¬ 
ate. 

The  D  function  also  simplifies  the  implementation 
of  display  routines.  It  can  do  some  of  the  error  check¬ 
ing  normally  associated  with  a  display  function.  The 
interface  manages  the  task  of  defaulting  display  param¬ 
eters  for  each  window,  so  the  programmer  can  depend 
on  his  display  routine  being  called  with  actual,  valid 
values.  The  interface  is  responsible  for  queueing 
display  commands  and  handling  viewports,  removing 
the  burden  From  the  actual  display  routine.  The 
researcher  can  use  any  of  the  above  features  when 
installing  his  new  display  routine  by  characterizing  his 
program  in  any  of  the  several  internal  databases  that 
handle  display  requests. 

Displays  can  be  modified  to  highlight  selected 
attributes  through  the  use  of  Function  Application 
Masks  (FAMs).  A  function  application  mask  (FAM)  is 
an  object  in  direct  correspondence  with  the  object 
being  displayed.  It  serves  as  binary  mask  that  deter¬ 
mines  whether  or  not  a  particular  data  point  should  be 
displayed.  For  example,  in  a  3D  terrain  display,  a 
FAM  can  be  used  to  Batten  a  group  of  mountains  and 
see  what’s  behind  them  by  inhibiting  the  display  of  any 
pixels  corresponding  to  the  range.  Another  example  is 
to  use  a  FAM  in  a  vector  display  so  that  only  vectors 
of  a  certain  magnitude  are  allowed.  Note  that  in  this 
case,  the  FAM  object  is  actually  more  complicated  than 
the  data  objects  (which  supply  the  de!ta-X  and  delta-Y 
values)  because  it  must  reference  both  data  sources  and 
threshold  their  cartesian  length.  Note  that  a  FAM  is 
not  needed  for  the  first  example  -  a  virtual  object  can 
be  created  that  incorporates  both  the  original  terrain 
elevation  data  and  the  terrain  feature  data,  and  zeros 
the  elevation  wherever  a  mountain  is  indicated  -  but  is 
required  in  the  second  example  where  no  vector  (as 
opposed  to  a  zero  length  vector)  is  desired. 


505 


There  is  also  a  routine  called  display-values  that 
allows  a  user  to  interactively  select  points  in  multiple 
objects  and  apply  arbitrary  functions  to  the  returned 
values.  With  this,  a  user  can  browse  '‘inside"  an  object 
description.  It  also  allows  the  user  to  manually  super¬ 
vise  localized  processing.  The  user  can  also  modify  the 
underlying  objects  based  on  data  he  can  interactively 
indicate  in  this  routine.  Display-values  is  especially 
useful  when  applied  to  label  planes  -  functions  can  be 
applied  to  user  selected  objects.  A  routine  for  interac¬ 
tively  segmenting  an  image  into  point,  curve  and  region 
objects  has  been  built  on  top  of  this  function. 


8.  SUMMARY 


The  variety  and  flexibility  of  this  computer  vision 
development  environment  allows  us  to  quickly  imple¬ 
ment  and  experiment  with  new  algorithms  and 
approaches.  The  combination  of  a  set  of  basic  extendi¬ 
ble  objects  with  uniform  interfaces,  virtual  objects,  sys¬ 
tem  databases,  interactive  browsers  and  a  powerful 
display  system  is  a  synergistic  one.  Each  piece 
amplifies  the  usefulness  of  the  others.  The  environment 
has  been  successfully  developed  and  used  by  researchers 
working  on  a  variety  of  problems.  They  have  found 
the  overall  environment  easy  to  use  and  powerful, 
although  there  are  still  areas  that  need  additional 
development.  Most  of  the  constructs  described  in  this 
paper  have  been  implemented  for  the  Symbolics  LISP 
machine  using  ZetaLisp  and  Flavors.  The  system  is 
currently  being  ported  to  a  Commonlisp  environment. 

ACKNOWLEDGEMENTS 

The  authors  wish  to  thank  John  Dye,  Kaiti  Riley, 
Danny  Edelson,  Jay  Glicksman,  Angela  Erickson,  and 
Laura  McConnell. 

This  work  was  largely  supported  under  U.S. 
Government  contract  number  DACA76-85-C-0005  for 
the  U.S.  Army  Engineer  Topographic  Laboratories 
(ETL),  Fort  Belvoir,  Virginia,  and  the  Defense 
Advanced  Research  Projects  Agency  (DARPA),  Arling¬ 
ton,  Virginia. 

REFERENCES 

Donaldson  et.al.  -  83]  -  D.  Donaldson,  R.  Kirby,  J.  Pal¬ 
las,  and  A.  Rosenfeld,  ‘‘UMIPS:  Univer¬ 
sity  of  Maryland  Image  Processing 
Software”,  Computer  Vision  Laboratory, 
Computer  Science  Center,  Maryland, 
December,  1983. 

Kohler  et.al.  -  83]  -  R.  Kohler,  et.al.,  “VISIONS: 

Operating  System  Reference  Manual", 
Computer  and  Information  Science  Dept., 
University  of  Massachusetts  at  Amherst, 
1983. 

Quam  -  84  -  L.  Quam,  "l.MAGECALC:  Reference 
Manual",  Stanford  Research  Institute, 
Menlo  Park,  California,  1984. 


506 


INTERPRETING  AERIAL  PHOTOGRAPHS 
BY  SEGMENTATION  AND  SEARCH 


David  Harwood 
Susan  Chang 
Larry  S.  Davis 

Center  for  Automation  Research 
University  of  Maryland 
College  Park,  MD  20742 


ABSTRACT 

A  knowledge-based  system  for  interpreting  aerial  photo¬ 
graphs,  Picture  Query  (PQ),  first  segments  an  image  into 
primitive,  homogeneous  regions,  then  searches  among  combina¬ 
tions  of  these  to  find  instances  which  satisfy  deflations  of 
object  types.  If  primary  evidence  is  insufficient,  there  may  be 
an  hypothesis-based  search  for  the  supporting  evidence  of 
related  objects.  This  secondary  search  is  restricted  to  windows 
by  expected  spatial  relations.  First  instances  are  improved  by 
searching  for  overlapping  variants  having  better  goodness-of- 
figure.  The  process  may  be  repeated  using  re-estimated  param¬ 
eters  of  object  definitions  based  on  instances  found  previously. 
Results  are  reported  for  images  of  suburban  neighborhoods, 
including  roads,  houses,  and  their  shadows. 

1.  INTRODUCTION 

Analysis  and  interpretation  of  images  are  the  principal 
objectives  of  computer  vision.  This  report  describes  a  system, 
called  Picture  Query  (PQ),  for  two-dimensional  interpretation 
of  images  and  its  application  to  aerial  photos  of  suburban 
neighborhoods. 

There  are  many  image  segmentation  methods  based  on 
intrinsic,  usually  local,  image  properties,  such  as  reflectance, 
texture,  or  adjacency.  Some  use  simple  parameters  or  rules 
derived  from  semantical  or  physical  considerations  about  very 
restricted  scenes,  in  addition  to  heuristics  about  general 
imagery.  However,  this  information  is  inadequate  for  defining 
even  common  objects;  most  of  these  methods  do  not  model 
important  geometric  properties.  Consequently,  such  segmenta¬ 
tion  methods  do  not  reliably  segment  regions  corresponding  to 
important  objects,  and  their  behavior  is  difficult  to  character¬ 
ize. 

Our  system  defines  types  of  and  relations  among  objects, 
and  uses  these  definitions  to  test  for  instances.  It  also  uses  an 
initial,  hierarchical  segmentation  and  a  heuristic  search  to 
incrementally  segment  the  image  into  specified  objects  by 
merging.  Because  of  the  use  of  general  definitions,  incremental 
instantiation,  and  complete  search,  the  results  are  reliable,  and 
the  procedure  is  easy  to  characterize. 


PQ  consists  of  three  subsystems:  a  high-level  vision  sys¬ 
tem  (HLVS)  representing  goals  or  queries  and  knowledge  about 
objects,  a  low-level  vision  system  (LLVS)  for  image  segmenta¬ 
tion  and  evaluation  of  predicates,  and  a  graph  search  pro¬ 
cedure  to  find  instances  by  merging  and  testing. 

The  LLVS  segments  the  image  into  a  one-parameter  fam¬ 
ily  of  segmentations  containing  increasingly  refined  connected 
components.  For  each  object  type,  a  suitable  segmentation  is 
used  as  a  basis  for  a  heuristic  search  to  find  instances,  by 
merging  combinations  of  components  and  testing  their  proper¬ 
ties  and  relations.  The  HLVS  communicates  with  the  user 
about  goals  and  about  definitions  of  objects  by  image  predi¬ 
cates,  maintains  a  database  of  current  interpretations,  and 
provides  segmentation,  search,  and  optimization  procedures 
with  parameters  associated  with  object  types. 

Important  characteristics  of  PQ's  data  representation  are 
description  of  object  types  by  general  image  predicates,  includ¬ 
ing  geometric  ones,  which  are  evaluated  logically,  and 
representation  of  images  by  attributed  adjacency  graphs  asso¬ 
ciated  with  variably  refined  segmentations  into  connected  com¬ 
ponents.  Other  important  characteristics  of  the  control  struc¬ 
ture  are  reinterpretation  with  adjusted  parameters,  optimiza¬ 
tion  of  instances  for  goodness-of-figure,  and  reasoning  with 
expected  spatial  relations.  These  will  be  explained  further  in 
appropriate  chapters. 

2.  HIGH-LEVEL  VISION  SYSTEM 

The  High-Level  Vision  System  (HLVS)  interactively  com¬ 
municates  with  the  user  concerning  queries,  parameters,  and 
definitions  of  objects  to  be  found  in  images.  It  represents 
descriptions  of  types  of  objects,  as  well  as  associated  parame¬ 
ters  which  are  used  during  segmentation,  search,  and  optimiza¬ 
tion.  In  addition,  it  maintains  a  database  of  facts  giving  the 
current  state  of  interpretation.  We  may  consider  this  system 
to  be  the  interpreter  of  the  image  data,  but  especially  of  the 
uninterpreted,  primitive  segmentation  given  by  the  Low-Level 
Vision  System  (LLVS);  this  interpretation  proceeds  by  heuris¬ 
tic  search  for  instances  of  object  types.  The  LLVS  and  the 
search  procedure  will  be  described  in  the  next  two  chapters. 

In  a  complex  scene,  segmentation  by  a  fixed  procedure 
usually  does  not  allow  simple  identification  of  important  parts 
of  an  image.  Some  objects  may  appear  to  be  fragmented  with 
any  choice  of  parameters,  so  that  merging  may  be  necessary; 
also,  some  segmentations  are  more  suitable  for  finding  some 
object  types  than  others.  In  addition,  interpretation  may 
depend  on  more  complicated  reasoning  involving  the  existence 


507 


and  locations  of  secondary  objects.  Finally,  having  identified 
some  instances  of  objects,  it  may  be  necessary  to  adjust  some 
parameter  values  so  that  the  remaining  instances  may  be 
discovered. 

We  now  describe  the  types  of  knowledge  used  by  the 
HLVS  to  control  interpretation,  including  modeling  of  objects 
by  definitions  with  parameters,  reinterpretation  with  adjusted 
parameters,  optimization  of  instances  according  to  goodness  oi 
figure,  and  contingent  search  using  spatial  relations. 

2.1.  Object  modeling 

The  models  of  objects  are  descriptions  of  their  properties 
and  relations  as  they  appear  in  images  Here  we  are  concerned 
with  two-dimensional  interpretation  of  images,  so  that  we  will 
identify  object  types  with  types  of  images,  defining  them  by 
image  predicates,  for  example,  by  grey-value  statistics,  areas 
and  distances,  and  tangent  directions  of  boundary  chains. 

Ordinarily,  an  instance  of  an  object  type  is  a  single  con¬ 
nected  component,  although  it  may  also  consist  of  multiple 
components.  Spatial  relations  are  defined  between  same  or 
different  object  types,  for  example,  between  adjacent  houses  or 
between  houses  and  their  shadows. 

The  description  of  an  object  type  consists  of  properties  of 
and  relations  between  connected  components  or  their  boun¬ 
daries.  In  our  system,  these  include  adjacency  and  relative 
position,  average  grey-value,  area,  compactness,  boundary 
length,  and  rectilinearity . 

Components  are  defined  by  8-connectivity.  Compactness 
is  defined  by  the  ratio  of  area  to  squared  perimeter,  and  is 
used  as  a  goodness-of-figure  measure  for  houses  since  it 
decreases  when  holes  and  gaps  are  present.  A  component,  is 
said  to  be  rectilinear  if  more  than  half  of  the  tangents  to  its 
boundary  lie  in  some  pair  of  orthogonal  directions. 

The  definitions  of  object  types  by  predicates  have  the 
form  of  conjunctions  of  inequalities  involving  property 
functions.  For  example,  house(x)  =  (min-,  area(x)  •  <.  max) 
and  (mine  brightness(x)  <max)  and  (min-  compactness(x)) 
and  (min  <  rectil ineari ty (x ) ),  where  there  are  several  parame¬ 
ter  bounds  In  our  present  system,  we  only  define  road  com¬ 
plexes,  that  is,  connected  complexes  of  streets  and  driveways 
which  are  not  analyzed  further.  Shadows  and  houses  are 
defined  with  the  same  property  functions,  while  road  com¬ 
plexes  are  not  tested  for  area  or  rectilinearity;  different  param¬ 
eters  are  associated  with  different  object  types. 

Relative  position  is  defined  by  a  window  of  given  size  and 
a  position  specified  relative  to  some  known  object  in  the 
image  It  is  used  to  restrict  contingent  searches  for  secondary 
objects  which  might  provide  supporting  evidence  or  whose 
existence  is  otherwise  expected.  For  example,  if  the  evidence 
for  a  house  is  insufficient,  supporting  evidence  of  its  shadow- 
might  be  sought.  Also,  if  a  row  of  houses  has  been  found,  a 
search  might  disclose  intermediate  instances  where  there  are 
large  gaps  in  the  row. 

2.2.  Parameter  adjustment 

Parameters  of  object  descriptions  are  usually  derived 
from  prior  expert  opinion  about  the  given  domain.  However, 
these  may  not  be  very  useful  in  general  because  of  variations 
due  to  imaging.  For  example,  roofs  may  appear  to  be  brown¬ 
ish  rather  than  silver-white,  due  to  inclination  of  the  sun,  or  to 
regional  differences.  As  a  result,  some  instances  of  an  object 
may  be  found,  while  others  are  not. 


Our  system  attempts  to  confirm  whether  the  parameters 
of  the  model  are  correct,  comparing  the  means  of  properties  of 
found  instances  to  current  model  parameter  values.  If  the 
difference  is  large,  the  image  is  reinterpreted  with  respect  to  an 
adjusted  object  description,  using  the  means  as  new  parameter 
values,  but  with  tighter  bounds  than  at  first.  This  readjust¬ 
ment  of  parameters  may  be  repeated  until  the  results  are 
unchanged. 

2.3.  Optimization  of  instances 

When  an  acceptable  instance  of  an  object  type  is  first 
found  by  merging  primitive  components,  it  may  be  that  there 
are  variant  regions,  overlapping  the  one  first  found,  which  are 
more  acceptable  as  instances.  For  example,  there  may  be  obvi¬ 
ous  gaps  or  irregularities  in  boundaries,  or  interior  holes. 

We  want  to  find  the  variant  of  the  first-found  instance 
which  has  best  goodness-of-figure  for  the  given  type  of  object. 
The  reason  for  this  is  that  the  graph  search  procedure  may 
stop  with  different  results  depending  on  its  starting  point  (seed 
component)  and  its  rules  for  expanding  search.  Because 
definitions  must,  be  general,  so  that  tests  do  not  discriminate 
against  variable  but  true  instances,  the  search  may  not  stop  at 
the  best  candidate  among  the  set  of  acceptable  variants. 

In  the  case  of  man-made  objects,  such  as  houses  and 
roads,  as  well  as  many  natural  objects,  such  as  trees,  instances 
not  only  have  significant  figure-ground  contrast,  but  usually 
have  regular  or  smooth  boundaries  and  filled  interiors.  Object 
definitions  usually  require  homogeneity  of  reflectance  proper¬ 
ties,  guaranteeing  figure-ground  contrast.  It  is  natural,  then,  to 
use  compactness  as  a  measure  of  goodness-of-figure  to  select 
among  variant  instances. 

The  optimization  procedure  is  simple:  starting  with  the 
first-found  instance,  find  the  variant  differing  by  at  most  one 
primitive  component,  which  is  also  acceptable,  and  having  the 
greatest  compactness.  This  is  repeated  until  no  better  candi¬ 
date  is  found,  with  the  final  best  one  being  returned  as  the 
instance  found  by  the  search  after  optimization.  This  simple 
heuristic  procedure  has  proved  to  be  very  effective  in  most 
cases,  but  it  should  be  possible  to  define  similar  measures  of 
goodness-of-figure  for  different  types  of  objects. 

2.4.  Reasoning  with  spatial  relations 

Humans  often  use  expected  spatial  relations  among 
objects  in  a  scene  to  find  subtle  or  obscure  instances  whose 
recognition  is  contingent  upon  having  already  found  instances 
of  related  objects.  For  example,  identifying  or  hypothesizing 
elevated  objects,  in  conjunction  with  physical  knowledge, 
makes  it  possible  to  recognize  shadows  or  occluded  objects. 

There  are  dual  aspects  of  reasoning  involving  spatial  rela¬ 
tions:  prediction  and  confirmation.  On  the  one  hand,  having 
found  an  object,  we  may  expect  to  find  spatially  related 
objects  by  a  restricted  search;  on  the  other  hand,  finding  the 
related  objects  may  tend  to  confirm  our  identification  of  the 
first  object,  which  may  have  been  tentative  or  hypothetical. 
Relations  usually  represent  mutual  information,  even  if  they 
do  not  definitively  identify  either  of  the  related  objects.  Thus 
we  may  use  expected  spatial  relations  to  locate  missing  objects 
or  to  confirm  or  disconfirm  uncertain  hypotheses. 

2.4.1.  Missing  objects 

Sometimes  instances  of  objects,  or  partial  instances,  are 
missed  because  of  their  unusual  appearance  or  because  they 


508 


are  obscured.  For  example,  houses  or  roads  may  be  partly 
overshadowed.  Still,  we  may  expect  them  because  of  our 
knowledge  about  regular  configurations  of  houses  along  roads. 

Our  system  attempts  to  recover  missing  houses  using 
expected  relative  positions  between  instances.  A  less  restrictive 
correct  position  relative  to  a  previously  found  instance  but 
which  did  not  itself  yield  an  acceptable  instance  of  a  house 
during  search  by  merging. 

2.4.2.  Supporting  evidence 

There  may  be  considerable  variation  of  appearance  in 
instances  of  a  given  object  type,  so  that  it  may  be  necessary  to 
relax  the  parameters  of  the  definitions  in  order  to  find 
instances  whose  evidence  is  weak.  Furthermore,  in  these  margi¬ 
nal  cases,  our  system  may  consider  evidence  arising  from  any 
types  of  objects  which  are  expected  to  be  spatially  related  to 
an  uncertain  candidate.  For  each  such  type,  a  search  is  made 
in  a  window  positioned  relative  to  the  uncertain  candidate.  If 
some  instance  of  a  related  object  is  found,  then  identification 
of  the  first  candidate  is  confirmed;  otherwise  it  is  rejected. 

This  concludes  our  description  of  the  HLVS  which 
represents  and  controls  interpretation  of  images.  We  now  con¬ 
sider  the  LLVS  which  produces  variably  refined  segmentations 
of  the  image,  along  with  attributed  adjacency  graphs,  and 
then  consider  the  search  procedure  which  generates  and  tests 
merged  regions  to  find  instances  of  object  types. 

3.  LOW-LEVEL  VISION  SYSTEM 

The  LLVS  system  of  PQ  is  quite  simple,  consisting  of 
image  enhancement  by  iteration  of  an  edge-preserving  smooth¬ 
ing  operator,  computation  of  a  one-parameter  family  of 
increasingly  refined  segmentations  of  the  enhanced  image  into 
homogeneous  connected  components,  and  a  library  of  functions 
used  in  defining  object  types  by  predicates  and  in  testing  can¬ 
didate  instances  during  search. 

3.1.  Enhancing  the  image 

The  input  image  is  enhanced  by  iteratively  applying  the 
symmetric-nearest-neighbor  (SNN)  operator,  sharpening  edges 
while  flattening  interiors  of  regions  [Harwood,  1984],  This 
operator  is  very  efficient  and  converges  rapidly  without  intro¬ 
ducing  artifacts;  there  are  also  versions  for  vector  images. 
About  20  iterations  are  usually  necessary  to  achieve  a  stable, 
enhanced  image,  one  that  is  99.9  percent  fixed. 

3.2.  Connected  components 

The  enhanced  image  is  segmented  into  homogeneous,  or 
almost-constant,  connected  components  by  a  standard  two- 
pass  routine  for  computing  8-connected  components,  in  which 
adjacent  pixels  are  regarded  as  connected  if  the  difference  of 
their  feature  values,  here  grey-value  alone,  is  sufficiently  small, 
ie.,  less  than  the  threshold  parameter  of  the  segmentation. 
The  family  of  increasingly  refined  segmentations  is  parameter¬ 
ized  by  this  threshold  value,  with  higher  parameter  values 
resulting  in  segmentations  into  fewer,  larger,  higher-contrast 
components,  and  lower  values  resulting  in  many  smaller, 
lower-contrast  components. 

3.3.  Attributed  adjacency  graphs 

Associated  with  each  segmentation  is  an  attributed  graph 
representing  the  adjacency  relation  and  some  properties  (area, 


mean  grey-value)  of  the  primitive  components.  The  family  of 
segmentations,  together  with  their  graphs,  is  the  basis  for  the 
search  procedure,  whose  goal  is  to  find  instances  of  object 
types  by  merging  components  and  testing  their  properties  and 
relations. 

3.4.  Image  predicates 

During  search,  various  functions  of  candidate  regions  or 
of  their  boundaries  are  repeatedly  computed,  including  area, 
grey-value,  perimeter,  tangent  directions  of  boundary,  so  that 
properties  and  relations  of  candidates  may  be  tested  as  to 
whether  they  satisfy  the  definitions  of  object  types. 

4.  SEARCH  FOR  INSTANCES 

The  search  procedure  is  given  a  query,  or  goal,  to  return 
the  first-found  instance  of  an  object  type,  found  by  merging 
and  testing  regions  within  a  specified  window.  The  window 
may  be  the  entire  image,  or  may  be  determined  by  expected 
spatial  relations  among  objects.  All  instances  of  an  object  type 
may  be  found  by  successive  searches  of  the  remainder  of  the 
window. 

4.1.  Co&rse-to-fine  strategy 

Search  first  uses  a  coarse  segmentation,  chosen  according 
to  prior  knowledge  of  the  expected  size  and  contrast  of  the 
object  type  in  the  query.  If  necessary,  search  may  continue 
with  more  expensive  computations  based  on  refined  segmenta¬ 
tions,  until  an  instance  is  found,  or  until  the  search  has  failed 
at  every  level.  This  coarse-to-fine  strategy,  beginning  with  an 
initial  segmentation  appropriate  to  the  object  type,  is  efficient 
for  finding  instances  when  they  exist. 

4.2.  Seeds 

Search  starts  from  seed  components  having  features 
which  are  typical  of  the  object  type.  Since  these  seeds  are 
often  smaller  than  the  complete  instances,  they  have  to  be 
simply  characterized  by  local  features,  such  as  grey-value 
Local  feature  vectors,  based  on  color  or  texture,  would  be 
much  more  useful  for  characterizing  seeds.  The  seeds  may  be 
ordered  by  difference  of  their  feature  values  from  the  typical 
values  (or  by  likelihood).  In  our  system,  every  object  instance 
found  by  merging  must  originate  from  some  seed  component 
having  typical  grey-value. 

4.3.  Merging  regions 

Our  method  of  search  by  merging  primitive  regions  is  a 
modified  graph  search,  operating  on  the  adjacency  graph  of  a 
segmentation;  it  will  be  described  in  the  following  subsections. 

4.3.1.  Motivation 

It  is  impossible  to  define  general  procedures  for  segment¬ 
ing  images  of  natural  objects,  except  when  their  properties  are 
very  uniform,  and  when  object-background  contrast  is  large. 
Simple  procedures  like  recursive  thresholding  or  split  and 
merge  algorithms  are  fairly  efficient  and  give  apparently  good 
segmentations  of  some  images.  But  the  problem  with  these  is 
not  that  they  generally  have  wrong  parameters,  but  that  they 
do  not  really  search  for  identifiable  objects  or  reason  about 
their  relations. 

There  are  a  number  of  difficulties  in  implementing  search 
procedures  for  interpretation,  including  adequate  description  of 
objects,  evaluation  of  evidence,  efficiency  and  completeness, 


509 


and  effective  heuristics  for  control.  But  these  difficulties 
seem  to  be  worth  overcoming  if  the  resulting  system  is  gen¬ 
erally  applicable  and  achieves  robust  and  sound  results,  at  the 
cost  of  time  occasionally  spent  in  exhaustive,  but  futile 
searches.  (There  may  be  good  heuristics  to  cut  these  off.) 


and  checking. 

(3)  The  area  of  R  is  less  than  the  maximum  area  of  the  object, 

and  its  average  brightness  is  within  bounds. 

(4)  One  of  the  following  rules  for  merging,  based  on  similarity 

and  size,  must  be  satisfied: 


I 


4.3.2.  Basic  concepts 

The  search  algorithm  is  bounded-depth,  best-first  search 
based  on  merging  and  testing  of  candidate  regions,  using  a 
heuristic  merit  function  to  direct  the  search.  The  terms  used 
to  describe  the  search  are  briefly  explained  in  the  following 
paragraphs. 

The  adjacency  graph  consists  of  a  set  of  nodes  represent¬ 
ing  components,  together  with  edges  representing  their  adja¬ 
cencies. 


Rule  (i):  Regardless  of  their  sizes,  merge  A  and  M  if 
their  feature  differences  are  small  enough. 

Rule  (ii):  Otherwise,  if  A  is  large  enough,  merge  A  and 
M  if  their  feature  differences  are  small 
enough  under  a  relaxed  constraint.  This 
rule  allows  merging  of  larger,  somewhat  dis¬ 
similar  regions,  as  long  as  condition  (3)  war¬ 
rants  their  merging.  This  may  apply  to  sur¬ 
faces  with  varying  orientation,  e  g.,  roofs  of 
houses. 


Another  graph,  the  search  graph,  is  generated  during  the 
search  process,  representing  information  about  merged  regions. 
A  node  of  this  graph  represents  a  merged  region  and  gives  its 
merit  value,  its  size  and  average  grey-value;  the  merged  region 
is  identified  by  the  path  of  components  which  were  merged, 
starting  with  the  seed  region  as  root,  resulting  in  the  current 
region. 

The  successors  of  a  node  represent  results  of  merging 
new.  adjacent  components  with  the  region  represented  by  the 
node. 


Rule  (iii):  If  the  area  of  A  is  very  small,  meige  it  with 
M.  A  may  be  an  artifact  or  noise. 

Generally,  other  rules  of  limited  applicability  may  be 
needed  to  handle  special  situations.  For  other  merging  rules, 
see  (Nazif,  1984]. 

Preliminary  tests  of  model  properties  are  contained  in 
condition  (3).  As  the  graph  search  proceeds,  merged  regions 
become  too  large  to  be  expanded.  The  other  conditions  restrict 
merging  so  that  it  is  non-redundant  and  preserves  homo¬ 
geneity  and  connectivity. 


I 


1 


i 

I 


The  open  list  is  a  priority  queue  of  components  which  are 
adjacent  to  the  current  merged  region  and  have  not  yet  been 
considered  for  expansion.  Their  order  of  merit  is  based  on  the 
differences  between  their  average  grey-values  and  the  expected 
value  of  the  object 

The  closed  list  consists  of  those  components  which  have 
already  been  eliminated. 

For  more  details  about  graph  search,  see  [Rich,  1983], 

4.3.3.  Algorithm  for  modified  graph  search 

(1)  Create  an  initial  search  graph  G,  consisting  solely  of 

the  start  node  S,  representing  the  best  seed  in 
the  list  of  seeds.  Create  empty  open  and  closed 
lists,  then  put  S  on  the  open  list. 

(2)  Loop:  if  the  open  list  is  empty,  quit  the  search  with 

failure  to  find  an  instance  of  the  object  type. 

(3)  Put  the  first  open  node  on  the  closed  list,  calling  it 

N. 

(4)  Expand  node  N  into  successors  which  satisfy  accep¬ 

tance  criteria,  as  detailed  in  Section  5.3.4. 

(5)  For  each  successor  in  turn,  check  whether  stopping 

criteria  (detailed  in  section’ 5.3.5)  are  satisfied.  If 
so,  quit  the  search  successfully,  returning  the 
resulting  instance. 

(6)  Reorder  the  open  list  according  to  merit. 

(7)  Repeat  Loop. 

The  search  procedure  is  then  restarted  with  a  new  seed 
until  the  search  succeeds  or  the  seed  list  is  exhausted. 

4.3.4.  Acceptance  criteria 

Let  M  be  the  region  represented  by  the  node  which  is 
currently  to  be  expanded.  Let  A  be  a  component  to  be  tested 
for  acceptability,  and  R  be  the  result  of  merging  M  and  A. 
Then  A  is  acceptable  if  it  satisfies  all  the  following  conditions: 

(1)  A  and  M  are  connected. 

(2)  R  is  different  from  every  closed  node;  this  avoids  redundant 

search  and  calculation,  at  the  lesser  cost  of  tabulation 


4.3.5.  Stopping  criteria 

The  search  stops  with  failure  when  it  exceeds  the  bound 
on  depth  without  finding  an  acceptable  instance. 

On  the  other  hand,  when  merging  results  in  finding  a 
candidate  region  which  is  an  instance  of  the  object  type,  meet¬ 
ing  all  the  criteria  of  its  definition,  the  search  might  stop  suc¬ 
cessfully,  returning  the  instance.  We  will  call  this  version  FF 
search,  and  the  returned  instance  will  be  called  the  first-found 
instance. 

The  problem  with  this  procedure  is  that  the  first-found 
instance  may  have  gaps  at  its  boundaries  or  holes  in  its  interi¬ 
ors,  so  that  even  though  it  passes  other  tests,  it  is  not  a  very 
good  instance.  Moreover,  there  may  be  overlapping  variants  of 
the  first-found  instance  which  are  better  instances.  Generally, 
the  first-found  instance  will  be  too  small,  being  incomplete. 

It  would  be  very  expensive  to  search  the  entire  space  of 
acceptable  candidates,  trying  to  find  one  which  is  the  best 
instance.  A  different  approach  is  to  expand  according  to  the 
merit  function  until  some  instance  is  found  which  cannot  be 
further  acceptably  expanded.  This  version  will  be  called  FL 
search,  and  the  returned  instance  will  be  called  the  first-large 
instance.  A  comparison  of  the  two  approaches  will  be 
presented  in  Chapter  6. 

5.  EXAMPLES 

This  chapter  presents  some  representative  examples  of 
the  application  of  PQ  to  the  analysis  and  interpretation  of 
aerial  photos  of  suburban  neighborhoods.  The  search  for 
instances  of  roads,  houses,  and  shadows  of  houses  is  described 
in  detail. 

5.1.  Goal 

The  goal  of  PQ  is  to  find  connected  regions,  obtained  by 
merging,  which  are  instances  of  object  types.  In  the  following 
examples,  we  consider  houses  and  their  shadows,  as  well  as 


3 


510 


roads,  but  other  types  of  objects  might  also  be  considered  by 
adding  their  definitions.  The  system  may  return  from  search 
with  all  disjoint  instances,  or  with  the  first-found,  possibly 
optimized  instance  of  the  requested  type,  if  any. 

The  system  asks  the  user  to  specify  what  type  of  object  is 
to  be  found  in  what  window  of  the  image — for  example,  houses 
in  the  entire  image.  There  may  be  several  such  goals  on  a 
stack,  which  are  considered  in  order. 

Given  a  current  goal,  PQ  first  checks  its  database  to  find 
whether  some  instance(s)  have  already  been  found  which 
satisfy  the  goal.  If  not,  PQ  searches  for  instances  as  specified 
by  the  goal. 

5.2.  Enhancement  and  initial  segmentation 

Iterative  application  of  the  SNN  operator  produces  an 
enhanced  image  having  flat  regions  with  fixed,  sharpened 
edges.  Figure  la  shows  the  original  440x160  image  of  a  subur¬ 
ban  neighborhood,  and  Figure  lb  shows  the  enhanced  image. 

The  initial  segmentation  of  the  enhanced  image  is 
obtained  by  a  connected  components  algorithm,  in  which  adja¬ 
cent  pixels  are  connected  if  their  grey-value  difference  is  less 
than  a  threshold  parameter  value.  Finer  segmentations  are 
associated  with  lower  threshold  values,  and  values  are  chosen 
for  searches  according  to  the  expected  sizes  and  contrasts  of 
object  types. 

Figure  2  shows  a  44x40  enhanced  image  of  a  house  and 
yard  and  its  segmentation  into  connected  components. 

In  general,  the  number  of  components  depends  on  the 
threshold  value,  and  also  on  image  size  and  contrast.  Figure  3 
shows  high-  and  low-contrast  segmentations  of  an  image  of  a 
neighborhood  using  two  thresholds;  the  image  with  lower  con¬ 
trast  has  fewer  and  larger  components. 

5.3.  Search  for  instances 

Given  a  query  about  a  type  of  object,  PQ  searches  for 
instances  by  merging  and  testing  regions.  The  search  originates 
at  seed  components  having  typical  grey-values  and  continues 
until  a  first-found  or  first-large  instance  is  found,  or  until  all 
candidates  fail  or  search  exceeds  the  depth  bound. 

Figure  4  illustrates  how  seed  components  for  houses  are 
selected  from  from  the  components  of  a  house  and  yard,  also 
shown  in  Figure  2.  Here  seed  components  1,4,5,  and  6  were 
selected  by  constraints  on  their  grey-value  (110<  g  <255)  and 
area  (<310),  and  then  ordered  6,5,4, 1  according  to  the 
difference  between  their  grey-value  and  the  typical  value. 

5.3.1.  Version  FF  graph  search 

In  our  example,  the  search  graph  initially  consists  of  a 
single  node  representing  seed  6.  Its  area  and  average  grey- 
value  are  94  and  152,  as  shown  in  the  table  in  Figure  4c. 

In  the  search  graphs  shown  in  Figure  5  and  6,  nodes  that 
are  closed,  having  been  considered  already,  are  dark  circles, 
while  nodes  that  are  still  open  are  white  circles.  A  new  cycle  of 
merging  and  testing  begins  whenever  an  open  node  is  closed. 

In  the  first  cycle,  region-merging  starts  with  seed  com¬ 
ponent  6.  The  table  in  Figure  4  shows  that  components  0,4, 
and  5  are  adjacent  to  component  6,  so  each  of  these  in  turn 
will  be  checked.  Component  0  fails  to  satisfy  condition  (4)  on 
area  and  grey-value,  but  component  4  is  acceptable  by  all  cri¬ 
teria,  so  it  is  chosen  to  be  merged  in  the  successor  of  node  (6), 
which  is  checked  against  the  stopping  criteria.  The  search  then 


successfully  terminates,  since  the  merged  region  consisting  of 
components  6  and  4  satisfies  the  definition  of  a  house.  Thus 
region  is  entered  in  the  database  as  an  instance  of  a  house, 
while  its  individual  components  are  removed  from  further 
search. 

Figure  5  shows  the  search  graph  of  the  last  two  cycles, 
and  Figure  5c  shows  the  instance  found.  This  instance  has  a 
hole  caused  by  early  termination  of  the  graph  search;  com¬ 
ponent  5  which  fills  the  hole  is  ignored.  The  remaining  seeds, 
0  and  5,  fail  to  grow  into  acceptable  house  candidates  since 
components  6  and  4  have  been  removed  from  further  search. 

5.3.2.  Version  FL  graph  search 

The  Version  FF  search  terminates  with  a  first-found 
instance,  even  though  there  may  be  holes  or  gaps  in  the  result¬ 
ing  figure,  as  in  the  above  example.  Version  FL  search,  on  the 
other  hand,  terminates  with  a  first-large  instance  which  cannot 
be  further  expanded. 

The  same  image  may  be  used  to  compare  the  two  ver¬ 
sions.  As  before,  seed  6  is  selected  first,  and  its  adjacent  com¬ 
ponents  are  0,4,  and  5.  While  0  remains  unacceptable,  4  and  5 
are  acceptable  for  expanding  node  (6).  In  the  second  cycle,  the 
node  (6,5),  having  greater  merit,  is  chosen  for  expansion;  it  has 
unused  adjacent  components  0  and  4.  As  before,  component  0 
fails  condition  (4).  Component  4  is  acceptable,  and  so  is 
merged  in  the  successor  in  the  path.  In  the  third  cycle,  the 
current  node  (6,5,4)  has  adjacent  components  0  and  1,  of 
which  1  is  acceptable.  Finally,  the  node  (6,5,4, 1)  cannot  be 
acceptably  expanded. 

This  merged  region  (the  union  of  components  6,5,4,  and 
1 )  is  now  tested  and  satisfies  the  definition  of  a  house,  so  is 
returned  as  the  first-large  instance.  If  a  candidate  does  not 
satisfy  the  definition,  the  search  continues  to  expand  open 
nodes  to  find  a  first-large  instance.  Figure  6  shows  the  FL 
graph  search  for  this  example,  and  the  returned  first-large 
instance. 

5.4.  Reasoning  by  the  HLVS 

The  HLVS  uses  its  database  of  accumulated  facts  about 
found  instances  to  control  parameter  adjustment,  optimization 
of  instances,  and  contingent  search  using  spatial  relations. 

5.4.1.  Adjusting  model  parameters 

The  HLV  system  compares  mean  values  of  properties  of 
found  instances  with  current  expected  values.  When  the 
differences  are  large,  the  current  values  are  replaced  by  the 
sample  means,  and  the  image  is  reinterpreted  with  respect  to 
the  new  expected  values.  This  procedure  may  be  repeated 
until  the  results  are  nearly  unchanged. 

To  illustrate  this,  Figure  7  shows  the  result  of  interpret¬ 
ing  a  480x410  image  before  and  after  adjusting  the  area  and 
grey-value  parameters  used  to  define  houses.  After  one  step, 
the  segmentation  is  obviously  improved,  and  after  three  itera¬ 
tions,  the  parameter  values  are  nearly  unchanged.  The  same 
number  (47)  of  true  instances  have  been  found,  but  seven  of 
ten  false  instances  have  been  eliminated. 

5.4.2.  Optimizing  instances 

Figure  5  shows  a  hole  in  a  a  first-found  instance  of  a 
house.  Its  compactness,  the  ratio  of  area  and  squared  perime¬ 
ter  (a  measure  of  goodness-of-figure),  is  0.033.  Among  all  vari¬ 
ants  which  differ  from  this  first  instance  by  at  most  one  com- 


511 


ponent,  the  candidate  consisting  of  components  4,5,  and  6  is 
also  an  acceptable  instance,  but  with  an  improved  value  of 
compactness,  0.067. 

Iteration  of  this  procedure  continues  until  compactness 
can  no  longer  be  improved.  In  the  previous  example,  a  second 
iteration  does  not  find  a  more  compact  instance,  so  that  the 
instance  consisting  of  components  4,5,  and  6  is  considered 
optimal.  Figure  8  shows  the  result  of  optimization  for  this 
example,  and  also  for  another  more  complex  example.  Incre¬ 
mental  optimization  of  instances  with  respect  to  compactness 
is  usually  effective,  although  it  may  be  possible  to  achieve  good 
results  for  other  types  of  objects  with  different  measures  of 
goodness-of-figure,  perhaps  based  on  contrast  at  boundaries. 

5.4.3.  Contingent  search 

Contingent  search  for  spatially-related  types  of  objects, 
based  on  identifying  or  hypothesizing  a  given  instance,  may  be 
used  to  find  missing  or  predicted  objects,  or  to  confirm  or 
reject  tentative  hypotheses  about  the  first  object. 

5.4.3. 1.  Supporting  evidence 

The  system  may  find  additional  instances  of  object  types 
within  low-contrast  or  complicated  images  by  relaxing  object 
definitions.  Given  such  an  uncertain  object  hypothesis,  the  sys¬ 
tem  may  search  for  instances  of  related  objects  that  would 
tend  to  support  or  disconfirm  the  hypothesis. 

Figure  9  shows  six  candidate  houses  in  an  image,  of 
which  the  two  in  the  upper  right  are  uncertain  instances,  that 
are  hypothesized  only  when  the  parameters  of  the  definition 
have  been  relaxed.  To  confirm  these  instances,  the  HLVS  ini¬ 
tiates  searches  within  restricted  windows  for  the  shadows  of 
the  two  candidate  houses.  The  result  of  this  search  is  shown  in 
Figure  9c. 

5.4.3.2.  Finding  missing  objects 

Sometimes  additional  instances  of  object  types  are 
expected  on  the  basis  of  previous  identification  of  related 
instances.  For  example,  Figure  10  shows  that  the  fourth  house 
in  the  top  row  is  overshadowed  so  that  it  is  not  found;  also, 
partial  instances  in  the  margin  of  the  image  are  missed.  Since 
neighboring  instances  have  been  identified,  the  system  initiates 
searches  within  windows  determined  by  the  neighboring 
instances,  beginning  with  seed  components  which  previously 
failed  to  grow  into  instances.  Because  of  the  expected  spatial 
relations,  the  system  provisionally  relaxes  the  definition  of  a 
house  to  admit,  as  an  instance  of  part  of  a  house,  any  seed  or 
acceptable  expansion  within  the  search  window.  When  this  is 
done,  Figure  10c  shows  that  three  missing  small  parts  of 
houses  are  found. 

5.5.  Other  experimental  results 

Figures  11-14  show  images  of  four  different  suburban 
neighborhoods,  their  initial  segmentations,  and  instances  of 
houses  and  roads  obtained  by  merging  and  testing.  All  these 
experiments  involve  FL  search,  reinterpretation  with  parame¬ 
ter  adjustment,  and  use  of  contingent  searches  for  missing 
houses.  Some  illustrate  optimization  as  well. 

Merging  to  find  roads  is  difficult,  requiring  special  treat¬ 
ment,  since  they  are  elongated,  adjacent  to  many  components 
along  a  road,  and  sometimes  have  poor  figure-ground  contrast. 
Given  these  difficulties,  only  when  primitive  regions  are 
sufficiently  large  are  they  acceptable  for  merging  as  road  com¬ 
plexes. 


Figure  11  shows  the  instances  of  houses  found  by  FL 
search,  without  optimization,  starting  with  a  coarse,  high- 
contrast  segmentation  into  components.  It  also  shows  the 
instances  of-  a  road  complex  found  in  a  finer,  lower-contrast 
segmentation. 

Figure  12  gives  similar  results  for  a  more  complicated 
image.  The  results  are  still  very  good,  correctly  identifying  all 
but  perhaps  two  of  fifty-one  instances  of  houses,  including  two 
difficult  overshadowed  instances,  with  only  two  false  instances. 
All  important  roads  were  also  found  in  this  image,  which  has 
good  contrast  but  lower  resolution. 

Figures  13  and  14  show  optimization  of  the  FL  instances 
of  houses.  The  optimized  instances  generally  have  better 
shapes,  with  fewer  gaps  and  protruding  components. 

6.  RELATED  WORK 

The  goal  of  the  PQ  system  is  to  find  regions  in  an  image 
which  correspond  to  certain  objects  in  a  scene.  Developing 
such  a  model-guided  segmentation  system  involves  addressing 
several  issues. 

The  first  issue  is  image  representation.  |Marr,  1982]  sug¬ 
gested  that  an  edge-based  primal  sketch  would  capture  most 
of  the  salient  information  about  an  image.  In  a  similar  spirit, 
Feldman  and  Yakimovsky  [Feldman,  1974]  segmented  images 
into  primitive  regions  of  almost  constant  brightness.  Since 
contrast  between  pixels  is  used  to  determine  these  regions,  this 
representation  preserves  much  of  the  edge  information  of 
Marr’s  primal  sketch. 

The  second  issue  is  how  to  apply  knowledge  in  segmenta¬ 
tion.  Some  image  understanding  (IU)  systems  restrict  the 
knowledge  to  the  high-level  portion  of  the  system,  using  it  to 
check  the  results  from  the  low-level  portion.  Such  systems 
often  use  simple  segmentation  methods  such  as  thresholding. 
See  [Selfrige,  1982],  (Hwang,  1985).  Other  IU  systems  pass 
knowledge  in  the  form  of  parameters  or  optimization  criteria 
to  their  low-level  components,  which  then  use  segmentation 
methods  involving  search;  see  [Feldman,  1974]  and  [McKeown, 
1984], 

In  the  following  subsections,  five  representative  systems 
related  to  PQ  are  discussed. 

6.1.  Feldman  and  Yakimovsky 

In  [Feldman,  1974],  a  combination  of  mathematical  deci¬ 
sion  theory  and  heuristic  search  techniques  is  used  to  develop 
a  general-purpose  system  for  scene  analysis. 

A  preliminary  segmentation  procedure  partitions  the 
image  into  primitive  regions.  Each  primitive  region  is  initially 
assigned  various  probabilities  of  interpretation,  based  on  the 
values  of  measurements  (e.g.  color,  area),  using  knowledge 
about  possible  objects  in  the  scene. 

After  this  preliminary  evaluation,  the  system  uses  a 
heuristic  search  to  merge  primitive  regions,  as  does  PQ.  The 
goals  of  the  two  systems  are  different.  The  goal  of  the  former 
system  is  to  find  the  best  global  interpretation,  the  one  having 
the  greatest  likelihood,  while  PQ’s  goal  is  partial  interpretation 
by  extracting  regions  that  satisfy  object  descriptions.  The 
merit  function,  acceptance  criteria,  and  stopping  criteria  used 
in  the  search  are  very  different. 

6.2.  Barrow  and  Tenenbaum 

In  1GS  [Tenenbaum,  1977],  knowledge  from  a  variety  of 


512 


sources  is  used  to  make  inferences  about  the  interpretations  of 
regions.  A  scene  is'  first  partitioned  into  elementary  regions 
consisting  of  individual  pixels  or  groups  of  adjacent  pixels  with 
identical  attributes.  Beginning  with  this  partition,  1GS  first 
performs  the  most  complete  interpretation  possible  in  the 
current  partition.  Based  on  this  interpretation,  it  next  merges 
a  pair  of  adjacent  regions  that  are  least  likely  to  represent  dis¬ 
tinct  objects.  Merging  is  repeated  until  all  adjacent  regions 
have  different  interpretations. 

In  a  later,  different  approach,  called  MSYS  [Barrow, 
1976],  regions  are  assigned  sets  of  possible  interpretations  with 
associated  likelihoods  based  on  local  attributes  (e.g.,  color,  size, 
and  shape).  A  relaxation  process  is  then  applied  repeatedly  to 
adjust  the  likelihoods  up  or  down  based  on  the  interpretation 
likelihoods  of  related  regions,  until  a  consistent  set  of  likeli¬ 
hood  values  is  attained.  At  this  stage,  several  alternative 
interpretations  may  still  exL‘.  for  some  regions.  The  final 
stage  of  analysis  involves  searching  for  a  set  of  unique 
interpretations  with  the  highest  joint  likelihood. 

6.3.  McKeown  and  Enlinger 

MACHINESEG  [McKeown,  1984]  uses  map  knowledge  to 
guide  image  segmentation.  The  map-to-iinage  correspondence 
can  be  derived  from  camera  and  terrain  models.  The  location 
of  each  map  feature  in  the  map  database  can  then  be  pro¬ 
jected  onto  the  image.  Map  knowledge  is  used  to  determine 
criteria  for  region-growing. 

The  merge  algorithm  is  straightforward.  Initially,  con¬ 
nected  components  of  homogeneous  image  intensity  are  pro¬ 
duced.  A  list  of  the  edges  between  regions  is  sorted  by  the 
strengths  of  the  edges.  Criteria  for  testing  merges  are  derived 
from  map-based  descriptions.  Regions  separated  by  weak 
edges  are  merged  first.  Each  time  a  new  region  is  created,  it  is 
checked  against  the  criteria  to  see  if  it  can  be  identified  as  a 
prototype  region.  The  processing  can  be  repeated  until  some 
merged  region  fits  the  criteria,  or  until  process  exceeds  a  given 
depth. 

In  contrast  with  MACHINESEG,  PQ  uses  variably  fine 
segmentations  rather  than  using  fixed  primitive  regions  as  in 
MACHINESEG.  In  MACHINESEG,  merging  is  controlled  by 
edge  strength  only.  In  PQ,  a  model-based  merit  function  is 
employed  to  determine  which  region  should  be  merged  next. 
This  is  more  flexible  than  a  function  based  on  intrinsic  image 
features  without  consideration  of  object  types. 

6.4.  Selfridge 

[Selfridge,  1982]  describes  a  three-level  system  (model 
expert,  segmentation  expert,  and  parameter  expert)  to  locate 
houses  and  roads  in  aerial  photographs.  The  model  expert 
represents  objects  and  their  expected  locations.  The  segmentar 
tion  expert  selects  a  segmentation  procedure  for  a  given  situa¬ 
tion.  The  parameter  expert  searches  for  the  best  parameters 
of  a  thresholding  algorithm  to  extract  a  region  matching  the 
expected  appearance. 

Performance  evaluation  of  success  and  failure  within  each 
level  .  integrated  into  the  analysis.  If  one  rule  fails,  another  is 
tried  The  program  terminates  with  failure  only  when  all  rules 
have  xeen  exhausted. 

6.5.  Hwang 

SIGMA  [Hwang,  1985]  employs  an  approach  similar  to 
Sel  idge,  but  emphasizes  evidence  accumulation.  Blobs  and 


ribbons  which  may  be  instances  of  houses  and  roads  are  ini¬ 
tially  extracted  from  the  image  using  simple  segmentation 
techniques.  These  instances  are  used  to  predict  the  locations 
of  others.  Computational  resources  are  focused  on  those  parts 
of  the  image  that  have  high  likehood  of  containing  additional 
objects  of  interests.  The  likehood  is  based  on  the  intersections 
of  windows  regions  of  interest  associated  with  the  hypotheses 
generated  by  the  system. 

Unlike  SIGMA,  PQ  does  not  assume  that  there  is  a  direct, 
correspondence  between  the  regions  computed  by  segmentation 
and  object  model.  PQ  emphasizes  the  role  of  model-directed 
search  for  interpreting  the  results  of  image  segmentation. 

7.  SUMMARY 

Effective  image  segmentation  is  crucial  to  any  image 
understanding  system.  Ideally,  the  segmentation  should  be 
sufficiently  good  to  reliably  determine  the  correspondence 
between  parts  of  the  images  and  parts  of  the  object  types  of 
interest. 

The  approach  of  this  investigation  has  been  to  initially 
segment  an  enhanced  image  into  small,  homogeneous  regions, 
then  to  search  among  these  for  combinations  that  can  be  inter¬ 
preted  as  object  instances.  Our  system,  called  PQ,  has  been 
applied  to  find  instances  of  roads,  houses,  and  their  shadows  in 
aerial  photographs  of  suburban  neighborhoods. 

The  three  subsystems  of  PQ  are  a  high-level  vision  sys¬ 
tem  (HLVS),  representing  goals  and  knowledge  about  objects, 
a  low-level  vision  system  (LLVS),  for  segmentation  and  evalua¬ 
tion  of  image  predicates,  and  a  graph  search  procedure  to  find 
instances  by  merging  and  testing. 

The  HLVS  consists  of  descriptions  of  object  types,  a 
database  of  facts  about  the  current  image  interpretation,  and 
parameters  of  the  initial  segmentation  and  search  for  given 
object  types. 

The  initial,  uninterpreted  segmentation  of  the  enhanced 
image  by  the  LLVS  is  represented  by  an  attributed  graph  of 
connected  regions,  giving  the  adjacency  relations  and  proper¬ 
ties  of  these  regions.  This  reduces  the  complexity  of  the  image 
data,  while  preserving  important  details  of  contrast  and 
geometry.  A  family  of  coarse-to-fine  segmentations,  obtained 
from  the  connected  components  algorithm  by  parameter 
adjustment,  may  be  used  in  different  searches. 

The  initial  segmentation  provides  a  basis  for  a  search  for 
object  instances  by  merging  components.  This  search  pro¬ 
cedure  employs  heuristics,  including  a  merit  function  for 
expansion  based  on  expected  object  grey  value,  as  well  as 
acceptance,  optimization,  and  stopping  criteria,  derived  from 
prior  knowledge  of  object  types. 

The  efficiency  of  the  search  procedure  is  improved  by 
selecting  a  suitable  initial  segmentation,  and  by  restricting 
search  based  on  known  spatial  relations  among  objects. 

It  might  be  interesting  to  consider  methods  of  evaluating 
evidence  which  having  intermediate  truth-values.  A  three¬ 
valued  logic,  with  true,  false,  and  indeterminate  values,  might 
be  psychologically  plausible,  and  useful  for  gradual  interpreta¬ 
tion  of  images.  In  additional,  it  might  be  useful  to  implement 
a  more  general  search  procedure  employing  window  functions, 
recursive  definitions,  and  finite  quantification.  The  latter 
would  make  it  possible  to  define  searches  for  specific  numbers 
of  instances,  for  example,  for  exactly  four  corners  of  a  house. 

The  present  system  correctly  and  efficiently  identified 


513 


most  instances  of  houses  and  roads  in  low-contrast  aerial  pho¬ 
tographs. 

REFERENCES 

[Barrow,  1976]  Barrow,  H.  G.  and  Tenenbaum,  J.  M.,  MSYS: 

a  system  for  reasoning  about  scenes,  Technical  Note 
121,  Artificial  Intelligence  Center,  Stanford  Research 
Institute  (1976). 

[Feldman,  1974]  Feldman,  J.  A.  and  Yakimovsky,  Y.,  Deci¬ 
sion  theory  and  artificial  intelligence:  I.  A  semantics- 
based  region  analyzer,  Artificial  Intelligence  5,  349- 
371  (1974). 

[Harwood,  1984]  Harwood,  D.,  Subbarao,  M.,  Hakalahti,  H., 
and  Davis,  L.  S.,  A  new  class  of  edge-preserving 
smoothing  filters,  TR-1397,  Center  for  Automation 
Research,  University  of  Maryland,  College  Park,  MD 
(1984). 

[Hwang,  1985]  Hwang,  V.  S.  S.,  Davis,  L.  S.,  and  Matsuyama, 
T.,  Hypothesis  integration  in  image  understanding 
systems,  TR-1513,  Center  for  Automation  Research, 
University  of  Maryland,  College  Park,  MD  (1985). 

Nlarr.  1982]  Marr,  D.,  Vision,  W.  H.  Freeman  Co.  (1982). 

McKeown,  1984]  McKeown,  D.  M.  and  Denlinger,  J.  L.,  Map- 
guided  feature  extraction  from  aerial  imagery, 
Workshop  on  Computer  Vision.  Representation  and 
Control,  Annapolis,  MD,  205-213  (1984). 

[Nazif.  1984]  Nazif,  A.  M.,  and  Levine,  M.  D.,  Low  level  image 
segmentation:  an  expert  system,  IEEE  Transactions 
on  Pattern  Analysis  and  Machine  Intelligence  6,  555- 
577  (1984). 

Rich,  1983]  Rich,  E,  Artificial  Intelligence,  McGraw-Hill 
(1983). 

Rosenfeld,  1982]  Rosenfeld,  A.  and  Kak,  A.  C.,  Digital  Picture 
Precessing,  Academic  Press  (1982). 

[Rosenfeld,  1984]  Rosenfeld,  A.,  Image  analysis:  problems,  pro¬ 
gress  and  prospects,  Pattern  Recognition  17,  3-12 
(1984). 

[Selfridge,  1982]  Selfridge,  P.  G.,  Reasoning  about  success  and 
failure  in  aerial  image  understanding,  PhD  Thesis, 
University  of  Rochester  (1982). 

Tenenbaum,  1977]  Tenenbaum,  J.  M.  and  Barrow,  H.  G., 
Experiments  in  interpretation-guided  segmentation, 
Artificial  Intelligence  8,  241-274  (1977). 


514 


(a)  Original  image 

(b)  Enhanced  image 

Figure  1:  Original  and  enhanced  image  of  two  rows  of 
houses 


00000000000000000000000001111 100000000000000 
00000000000000000000200001111000000000000000 
00000000000330000002222001111000000000000000 
00000000003333000002222211111100000000000000 
00000000033333000002222221111110000000000000 
00000000033333000000222221111111 100000000000 
00000000033333000000022201111111100000000000 
000000000033000000000000001 11 1 1 1 100000000000 
000000000000000000000000001 1 1 1 11000000000000 
00000000000000000000000001111111000000000000 
00000000000000000000000001111111100000000000 
00000000000044444444000444441111100000000000 
00000000000444444444444444444411100000000000 
00000000004444444444444444444444000000000000 
00000000004444444444444444444444000000000000 
00000000004444444444444444444444400000000000 
00000000000444444444444444444444000000000000 
00000000000444444445554444444444000000000000 
00000000000608600355555555544444000000000000 
00000000000066060665555555506400000000000000 
00000000000660006066065556600608000000000000 
00000000000600606066600606606660000000000000 
00000000000666666066666606666600000000000000 
00000000000666666666666666666000000000000000 
oooooo oooooooooooooooooooonooooooooooooooooo 
00000000000000000000000000000000000000000000 
00000000000000000000000000000000000000000000 
00000000000000000000000000000000000000000000 
00000000000000000000000000000000000000000000 
00000000000000000000000000000000000000000000 
00000000000000000000000000000000000000000000 
00000000000000000000000000000000000000000000 
00000000000000000000000000000000000000000000 
00000000000000000000000000000077700000000000 
00000000000000000000000000000777770000000000 
00000000000000000000000000007777770000000000 
00000000000000000000000000000777770000000000 
00000000000000000000000000007777770000000000 
00000000000000000000000000000777777000000000 
00000000000000000000000000000077770000000000 


(c)  Printout  of  component  labels 


Figure  2:  Initial  segmentation  of  a  house  and  yard 


515 


(a)  Original  image 

(b)  High-contrast  segmentation 

(c)  Low-contrast  segmentation 

Figure  3:  High-  and  low-contrast  segmentations  of 
neighborhood 


(a)  Enhanced  image 

(b)  Components 

(c)  Property  and  adjacency  table 

Figure  -4:  Initial  segmentation  and  table  of  components: 
Properties  and  adjacencies  for  the  house  and  yard  image 


(a)  Seed 


(b)  Cycle  1 


(c)  First-found  instance 


Figure  5:  FF  search  for  first-found  house  instance 


(e)  First-large  instance 

Figure  6:  FL  search  for  first-large  house  instance 


(a)  Original  image 

(b)  House  instances  without  model  adjustment 

(c)  House  instances  with  model  adjustment 

Figure  7:  Reinterpretation  with  adjusted  model 
parameters 


(a)  Enhanced  image 

(b)  First  found  house  instance 

(c)  House  instance  after  optimization 

(d)  Original  image 

(e)  First-found  house  instances 

(f)  House  instances  after  optimization 

Figure  8:  Optimization  of  house  instances 


517 


(a)  Enhanced  image 

(b)  House  instances 

(c)  Shadows  as  supporting  evidence 

Figure  9:  Search  for  supporting  evidence 


(a)  Original  image 

(b)  House  instances 

(c)  Recovered  parts  of  houses 

Figure  10:  Search  for  missing  objects 


(a)  Original  image 

(b)  Components  for  threshold  12 

(c)  House  instances 


(d)  Components  for  threshold  6 

(e)  Road  complex  instance 

Figure  11:  High-  and  low-contrast  segmentations  for 
houses  and  roads 


518 


(a)  Original  image 

(b)  High-contrast  segmentation 

(c)  House  instances 

(d)  Low-contrast  segmentation 

(e)  Road  instances 

Figure  12:  High-  and  low-contrast  segmentations  for 
houses  and  roads 


(a)  Original  image 

(b)  Initial  segmentation 

(c)  Unoptimized  house  instances 

(d)  Optimized  house  instances 

Figure  13:  FF  house  instances  before  and  after  optimi¬ 
zation 


(a)  Original  image 

(b)  Initial  segmentation 

(c)  Unoptimized  house  instances 

(d)  Optimized  house  instances 

Figure  14:  FF  house  instances  before  and  after  optimi¬ 
zation 


Initial  Hypothesis  Formation  in  Image  Understanding 
Using  an  Automatically  Generated  Knowledge  Base 


Nancy  Bonar  Lehror  George  Reynolds 

Joey  Griffith 

University  of  Massachusetts  at  Amherst 
Department  of  Computer  and  Informat  ion  Science 
Amherst,  Massachusetts  0100.3 


Abstract 

This  paper  presents  a  method  for  initial  object  hypothesis  for¬ 
mation  in  image  understanding  where  the  knowledge  base  is  au¬ 
tomatically  constructed  given  a  set  of  training  instances  The 
hypotheses  formed  by  this  system  are  intended  to  provide  an 
initial  focus-of-attention  set  of  objects  for  a  knowledge-directed, 
opportunistic  image  understanding  system  whose  intended  goal 
is  the  interpretation  of  outdoor  natural  scenes.  Presented  is  an 
automated  method  for  defining  world  knowledge  based  on  the 
frequency  distributions  of  a  set  of  training  objects  and  feature 
measurements.  This  method  take  into  consideration  the  im¬ 
precision  (inaccurate  feature  measurements)  and  incompleteness 
(possibly  too  few  samples)  of  the  statistical  information  available 
from  the  training  set,  A  computationally  efficient  approach  to 
the  Dernpster-Shafer  theory  of  evidence  is  used  for  the  represen¬ 
tation  and  combination  of  evidence  from  disparate  sources.  We 
chose  the  Dempster-Shafer  theory  in  order  to  take  advantage  of 
its  rich  representation  of  belief,  disbelief,  uncertainty  and  con¬ 
flict.  A  brief  intuitive  discussion  of  the  Dempster-Shafer  theory 
of  evidence  is  contained  in  Apendix  A 

1.  The  Interpretation  Problem 

An  important  task  in  image  understanding  is  to  de¬ 
velop  a  mapping  from  low-level  image  events  (such  as  re¬ 
gions,  lines  or  surfaces)  to  higher  level  semantic  abstrac¬ 
tions  (such  as  road,  grass  and  foliage).  Achieving  this 
task  requires  developing  a  knowledge  base  which  defines 
these  semantic  abstractions  in  terms  of  the  low-level  image 
events  and  using  constructive  techniques  to  match  or  cor¬ 
relate  primitive  features  of  the  low  level  events  against  the 
knowledge  base  for  each  semantic  abstraction.  An  inference 
technique  is  required  to  compare,  contrast  and  combine 
match  scores  to  create  a  consistent  interpretation.  Some 
schemes  for  combining  information  from  various  sources 
in  initial  hypothesis  formation  include  additive  “voting” 
methods  (Hans86,Belk86|,  Dempster-Shafer  pooling  of  evi¬ 
dence  |Low82,Wes84,Rey86b|,  Bayesian  methods  |Duda73|, 
constraint  propagation  |Kitc84,Wal72|,  as  well  as  many  ad 

'Tliiii  work  has  been  supported  by  the  following  grunts:  Air  Force 
Office  of  Scientific  Research  8fi-0021,  the  Defence  Mapping  Agency  800- 
8fi-C-00!2,  and  the  National  Science  Foundation  D('R-8.S!877fi. 


hoc  but  heuristically  adequate  Myrin  type  systems  |Shor76|. 
In  any  scheme  three  questions  must  be  answered:  What  is 
What  method  will  be  employed  for  combining  and  propa¬ 
gating  evidence? 

In  this  paper  we  present  a  method  for  initial  object  hy¬ 
pothesis  formation  in  image  understanding  that  has  evolved 
from  earlier  work  in  the  VISIONS  system  environment  doc¬ 
umented  in  |Hans87].  These  hypotheses  ran  then  be  used 
as  a  focus-of-attention  set  by  a  knowledge-directed  inter¬ 
pretation  system  and  expanded  into  a  more  complete  inter¬ 
pretation.  An  automated  method  is  presented  for  defining 
a  knowledge  base  based  on  the  frequency  distributions  of 
a  set  of  training  objects  and  feature  measurements.  The 
system  takes  an  image  that  is  segmented  into  closed  bound¬ 
ary  regions  and  “matches"  each  region  against  the  stored 
knowledge  base  to  generate  a  set  of  initial  hypothesis  for 
a  given  region.  This  method  ran  also  be  used  to  classify 
lines,  surfaces  or  any  other  image  abstraction  or  combina¬ 
tion  of  image  abstractions.  In  our  formalism,  evidence  is 
represented  by  a  plausibility  function  and  combined  using  a 
computationally  efficient  approach  to  the  theory  of  evidence 
as  pioneered  by  (lien  Shafer  referred  to  as  the  Dempster- 
Shafer  theory  of  evidence  |l)em|)67.Shaf7fi|. 

At.  the  heart  of  the  Dempster-Shafer  formalism  is  a  rich 
representation  of  evidence,  a  belief  function  or  mass  func¬ 
tion,  and  a  method  for  combining  evidence,  Dempster’s 
Rule.  The  formalism  is  often  criticized  for  the  computa¬ 
tional  cost  associated  with  Dempster’s  Rule;  in  addition 
the  Dempster-Shafer  theory  does  not  address  the  issue  of 
acquiring  mass  functions.  The  system  presented  lere  ad¬ 
dresses  both  these  problems.  It  is  able  to  use  the  rich 
semantics  implied  by  the  evidential  representation  of  the 
Dempster-Shafer  formalism  without  the  computational  cost 
associated  with  Dempster's  Rule  as  well  as  defining  an  au¬ 
tomatic  method  for  generating  a  knowledge  base.  A  brief 
intuitive  discussion  of  the  fundamental  principles  of  the 
Dempster-Shafer  formalism  are  presented  in  appendix  A. 

2.  The  VISIONS  Experiments 

In  general  we  can  think  of  intermediate-level  image  ab¬ 
stractions  as  tokens  one  or  two  steps  removed  from  the  raw 
image  data  represented  as  pixels.  Regions  ran  be  thought 
of  as  area  filling  abstractions  connecting  pixels  with  some 


521 


homogeneity  constraint  [Kohl8^|.  Straight  lines  may  con¬ 
nect  and  group  pixels  with  the  same  gradient  direction 
[Burn8fi,Weis8fi|.  Each  intermediate  level  token  ran  he  as¬ 
sociated  with  a  feature  vector  that,  measures  primitive  fea¬ 
tures.  If  pixels  are  grouped  into  closed  area-filling  regions, 
measures  can  he  defined  which  statistically  describe  a  re¬ 
gion's  color,  intensity  or  texture.  Depending  upon  the  line 
algorithm  used,  lines  also  have  a  variety  of  primitive  fea¬ 
tures  such  as  length,  position  in  the  image  (rho),  angle  or 
orientation  (theta),  and  measures  of  the  contrast  relating 
the  values  of  pixels  on  either  side  of  the  line. 

An  approach  taken  in  VISIONS  |Hans87|  is  to  use  these 
primitive  features  to  create  initial  hypothesis  labels  which 
associates  with  each  region  a  semantic  label  to  bootstrap 
the  high-level  interpretation  process.  Although  regions  are 
being  labeled  in  this  approach,  line  data  can  be  incorpo¬ 
rated  into  the  interpretation  process  by  the  use  of  relation¬ 
ships  between  lines  and  regions  |Belk86j. 

2.1  Approaches  to  Representing  and  Com¬ 
bining  Evidence 

Existing  approaches  to  the  generation  of  initial  hypothe¬ 
ses  have  used  interactive  and  heuristic  approaches  to  the 
knowledge  engineering  problem.  One  in  particular,  the 
rule-based  object  hypothesis  system  of  Hanson  and  Rise- 
man  HansSfi1,  defines  constraints  on  the  features  of  line 
and  region  image  abstractions;  it  uses  frequency  distribu¬ 
tions  to  guide  the  formation  of  heuristically  defined,  piece- 
wise  linear  ranking  functions  called  rules.  In  that  system, 
rules  provide  a  vote  for  a  specific  object,  and  a  set  of  “sim¬ 
ple"  rules  are  combined  via  a  weighted  average  to  form  a 
"complex"  rule  which  ran  then  be  similarly  combined.  The 
out  put  of  a  complex  rule  represents  a  weighted  combination 
of  evidence  from  various  features  and  is  used  to  rank  order 
image  regions  (or  collections  of  regions  and  lines)  according 
to  how  well  they  match  a  prototype  object. 

The  contribution  of  the  system  presented  here  is  two 
fold;  (a)  a  formal  and  theoretical  foundation  for  the  com¬ 
bination  and  representation  of  evidence,  and  (b)  the  auto¬ 
matic  construction  of  the  knowledge  base.  It  defines  t.he 
knowledge  base  automatically  using  statistical  information 
obtained  from  a  set  of  training  object  instances  and  uses  a 
computationally  efficient  approach  to  the  Dempster-Shafer 
theory  of  evidence  for  the  representation  and  combination 
of  evidence  from  disparate  sources.  The  knowledge  base  is 
represented  in  terms  of  plausibility  distributions ;  their  con¬ 
struction  takes  into  consideration  the  imprecision  (i.c.  inac¬ 
curate  feature  measurements)  and  incompleteness  (i.c.  too 
few  samples)  of  the  statistical  information  available  from 
the  training  set,.  We  show  that  the  automatically  generated 
plausibility  distributions  characterize  the  range  of  feature 
values  associated  with  each  object  in  the  training  set.  with¬ 
out  biasing  the  interpretation  towards  objects  more  likely  to 
appear  ?*  rn-dom  in  the  training  set.  The  Wm  plausibility 
is  used  to  indicate  the  equivalence  between  our  formalism 
and  the  semantics  and  functionality  of  the  term  plausibility 


in  the  Dempster-Shafer  formalism. 

2.1.1  The  Rule  System 

The  task  of  the  rule-based  object  hypothesis  system 
of  Hanson  and  Riseman  [Hans86|  system  is  to  provide  a 
set  of  candidate  object  hypotheses  to  a  knowledge-directed 
schema  network.  Based  on  these  initial  hypotheses  an  is¬ 
land  driving  focus-of-attention  strategy  is  employed  to  ini¬ 
tiate  further  processing.  Rules  are  formed  by  heuristically 
assigning  a  vote  for  a  specific  object  to  ranges  of  feat  ure  val¬ 
ues.  Sets  of  “simple”  rules  are  combined  via  a  weighted  av¬ 
erage  from  “complex”  rules.  Sets  of  complex  rules  are  then 
combined  in  the  same  way  to  form  the  initial  hypotheses. 

A  rule  is  represented  as  six  threshold  values.  #,.(), . 

these  thresholds  define  a  piecewise  linear  mapping  function 
from  feature  space  to  object  space.  The  intervals  |  oe,#i] 
and  (6g>°o)  represent  a  veto  range,  the  range  where 

the  object  label  associated  with  the  prototype  feature  vec¬ 
tor  receives  a  maximum  vote  of  1,  the  intervals  0,.02i  and 
[54,9s|  a  noncommittal  vote  of  zero,  and  finally  i(?2,03j  is 
linearly  ramped  from  0  to  1  whereas  [0s.0fij  is  ramped  from 
1  to  0.  Figure  2.1  shows  the  construction  of  a  simple  rule. 
The  weights  given  to  the  combination  rule  are  heuristically 
defined  on  a  scale  from  I  to  5. 


Figure  2.1:  Structure  of  a  Simple  Rule 

Structure  of  a  simple  rule  for  mapping  an  image  feat  ure  mea¬ 
surement  /  into  a  score  for  a  label  hypothesis  on  the  basis 
of  a  prototype  feat  ure  value.  The  ohject  specific  mapping  is 
parameterized  by  seven  values,  fP,9\,  ,Hc, 

Due  to  the  potentially  large  number  of  rules  needed  for 
each  object  in  the  initial  hypothesis  set.,  the  rule  system  is 
designed  to  simplify  the  definition  of  each  rule.  The  user 
choosing  the  threshold  values  9,  through  9r>  for  a  particu¬ 
lar  object  is  equipped  with  an  interactive  environment  for 
constructing  these  rules  and  displaying  their  effect.  In  this 
system  each  object  has  a  different  set  of  features  which 
contribute  to  the  objects  prototype  feature  vector.  Given 
an  image  token,  the  score  for  one  object  is  never  directly 
compared  against  the  score  for  another  object.;  instead  for 


522 


each  object,  tokens  are  ranked  by  how  well  they  score  for 
that  object.  Regions  with  the  highest  scores  for  a  particu¬ 
lar  object  are  used  as  exemplar  regions  in  an  island  driving 
strategy  applied  in  later  stages  of  the  interpretation. 

The  rule  system  was  developed  in  reaction  to  the  prob¬ 
lems  of  using  Bayesian  techniques  for  the  classification  of  to¬ 
kens  based  strictly  on  statistical  information  available  from 
an  inadequate  training  set.  In  particular,  if  the  training  set 
is  small,  the  feature  distribution  may  only  coarsely  char¬ 
acterize  feature  space;  in  addition  the  a  priori  probability 
of  seeing  a  particular  object  at  random  in  the  training  set 
is  also  a  highly  unreliable  estimate  of  seeing  that  object 
in  the  world,  yet  plays  a  powerful  role  in  bayesian  deci¬ 
sion  functions  (Duda73,Wood78j.  Section  4.1.1  discusses 
in  more  detail  the  problems  inherent  in  Bayesian  methods 
when  the  statistical  samples  contain  inaccurate  or  imprecise 
information,  see  also  [Low82,Wes84j. 

2.1.2  The  Plausibility  Formalism 

The  approach  used  in  our  plausibility  formalism  differs 
in  many  aspects  from  the  rule  system.  Whereas  the  high 
level  goal  of  producing  object  hypotheses  for  a  knowledge 
directed  schema  network  is  common  to  both  systems,  our 
approach  is  not  thought  of  as  producing  a  ranking  of  re¬ 
gions  for  each  symbolic  object  (although  the  results  may 
ultimately  be  used  in  this  manner).  The  outcome  is  in¬ 
stead  viewed  as  associating  with  each  region  an  evidential 
model  of  the  current  state  of  interpretation. 

Our  general  approach  resembles  the  principles  behind  a 
least  commitment  or  constraint  propagation  system.  The 
system  uses  a  set  of  features  to  rule  out  possible  object 
hypotheses  until  only  a  small  set  of  initial  hypotheses  re¬ 
main.  Each  piece  of  evidence  results  in  an  assignment  of  a 
plausibilty  value  to  each  hypothesis,  yeilding  a  plausibility 
function  defined  on  the  set  of  hypotheses.  Each  plausibil¬ 
ity  value  is  generated  by  comparing  a  feature  value  for  an 
image  event  against  the  knowledge  base.  For  example,  the 
system  may  measure  the  average  color  of  region  2;  the  value 
returned  is  then  compared  with  the  knowledge  base  for  each 
of  road,  foliage,  sky...  to  find  a  plausibility  value  for  each 
possible  semantic  abstraction.  The  plausibility  value  rep¬ 
resents  the  amount  to  which  that  hypothesis  should  not  be 
ruled  out  as  a  possibile  hypothesis.  Each  plausibility  func¬ 
tion  represents  a  mass  function  (a  method  for  transforming 
a  plausibility  function  into  a  mass  function  is  discussed  in 
Section  3.).  The  plausibility  functions  derived  from  each 
feature  measurement  are  combined  such  that  they  repre¬ 
sent  the  plausibilities  of  each  singleton  hypothesis  as  if  the 
equivelent  mass  functions  were  combined  using  Dempster’s 
Rule.  The  result  is  a  combined  plausibility  function  which 
represents  a  consensus  of  opinions  from  disparate  pieces  of 
evidence.  Figure  2.2  shows  the  construction  of  a  plausibility 
function  given  a  feature  measure  and  a  knowledge  base. 

It  is  important  to  understand  that  the  plausibility  func¬ 
tions  returned  by  each  knowledge  source  are  not  statements 
about  the  probability  of  a  hypothesis  in  a  Bayesian  view. 
The  role  of  a  plausibility  value  is  to  rule  out  unlikely  states 


of  nature.  The  minimum  semantic  requirement  for  a  plau¬ 
sibility  value  is  that  it  represent  the  extent  to  which  a  state 
of  nature  should  not  be  ruled  out  given  an  event. 

2.1.3  Differences  in  Evidential  Representation 

In  the  rule  system,  a  vote  can  have  one  of  three  differ¬ 
ent  effects  on  an  object  hypothesis.  If  the  vote  is  between 
zero  and  one,  this  supplies  supporting  evidence,  a  vote  of 
—  oo  vetoes  an  object  hypothesis  altogether,  and  a  vote  of 
zero  is  noncommittal.  The  final  score  represents  supporting 
evidence  for  a  particular  object.  On  the  surface,  the  plausi¬ 
bility  formalism  only  allows  ruling  out  hypotheses,  but  by 
transforming  a  plausibility  function  into  a  mass  function 
after  combination  we  are  also  able  to  represent  supporting 
evidence  as  well  as  the  possiblit.y  that  an  image  abstraction 
may  not  be  any  of  the  objects  represented  by  the  set  of 
possible  hypotheses,  i.e.  conflicting  evidence. 


Figure  2.2:  Constructing  Plausibility  Functions 
The  three  shaded  curves  represent  the  knowledge  base  for 
Obji,  Ob] '2,  and  Oijs  in  the  form  of  plausibility  distributions 
for  some  feature.  Given  a  feature  value  /,  the  plausibility 
value  Pl(Obj  |  f)  can  be  determined  from  the  plausibilty 
distribution  as  shown  by  the  shaded  circles.  A  plausibility 
function  is  the  set  of  plausibility  values  obtained  for  a  given 
feature  value. 

The  greatest  advantage  of  the  Dempster-Shafer  approach 
is  its  representation  of  several  different  types  of  evidence: 
supporting  evidence,  plausible  evidence,  and  conflicting  ev¬ 
idence.  It  greatest  disadvantage  is  the  exponential  nature  of 
Dempster’s  Rule  and  the  explicit  representation  of  the  pow- 
erset  of  all  object  hypothesis.  Our  system  provides  a  tech¬ 
nique  that  allows  the  representative  power  of  the  Dempster- 
Shafer  approach  to  be  combined  with  the  computational 
efficiency  of  the  rule  system. 

3.  A  Computationally  Efficient  Approach 
to  Dempster-Shafer 

A  major  criticism  of  the  Dempster-Shafer  formalism  has 
been  the  combinatorial  problems  related  to  use  of  the  pow- 
erset  of  all  possible  hypotheses  in  Dempster’s  Rule.  As 
the  frame  of  discernment  becomes  large,  the  computation 


523 


becomes  unmanageable.  To  overcome  this  combinatorial 
problem,  needed  is  a  way  to  represent  a  mass  function  us¬ 
ing  only  the  elements  of  the  frame  of  discernment  and  a 
combination  rule  that  is  equivalent  to  Dempster’s  Rule  for 
this  simplified  representation.  Reynolds  et  al.  |Rey86b|  de¬ 
scribe  a  method  whereby  knowledge  sources  need  not  return 
a  mass  function,  but  rather  a  plausibility  function  where 
each  individual  element  of  the  frame  of  discernment,  0,  is 
assigned  a  plausibility  value  between  7,ero  and  one.  A  mass 
function  representation  for  a  given  plausibility  function  can 
be  obtained  using  the  formula: 

n pi(a i f)  no- pi(° i /)) 

m(A)  =  ^ - - .  (3.1) 

where  pl(a  \  f)  is  the  plausibility  of  seeing  object  a  given 
a  feature  value  /  (see  figure  2.2).  The  conflict  value  of  a 
plausibility  function  is  defined  as 

*- no  ■*(«!/))•  0-2) 

<\C-A 

Given  two  plausiblity  functions  pl\(a  |  f})  and  p/2(a  | 
/2),a  6  we  define  the  combination  pl^(n  \  f\  A  h)  by  the 
rule 

p««  I  /,  A  /,)  -  d'-WijLlJ'W  M  ,  e,  (3.3) 

where  k  is  the  conflict  value  defined  above  for  ph{a  |  /t  A 
fi)  Analogously  we  define  the  combination  of  an  arbitrary 
number  of  plausiblity  functions. 

It  is  shown  in  [Rey86b|  that  this  combination  rule  pro¬ 
duces  the  the  plausibilities  of  the  singletons,  as  defined  in 
the  Dempster-Shafer  theory,  when  the  mass  functions  gen¬ 
erated  by  formula  3.1  are  combined  using  Dempster’s  Rule. 
The  terms  plausibility  values  and  plausibility  functions  were 
chosen  to  point  out  this  equivalence.  A  complete  discussion 
and  proof  of  the  equivalence  of  this  simple  representation 
and  multiplicative  rule  with  a  class  of  mass  functions  and 
Dempster’s  Rule  can  be  found  in  Reynolds  et  al.  [Rey86b|. 

4.  Defining  the  Knowledge  Base 

In  any  image  understanding  scheme  one  of  the  most  ill- 
defined  tasks  is  that  of  defining  the  knowledge  base.  To 
overcome  the  problems  of  an  inaccurate  training  set,  the 
rule  system  creates  its  knowledge  base  by  hand  using  the 
histogram  of  a  set  of  training  instances  for  heuristic  guid¬ 
ance.  On  the  other  hand,  Bayesian  techniques  address  the 
problem  of  efficient  definition  of  object  rules  for  a  given  fea¬ 
ture  in  terms  of  their  conditional  probability  distribution, 
but  these  techniques  lack  the  ability  to  handle  inacurate 
or  insufficient  sample  sets.  In  this  section  is  to  discussed 
a  method  for  building  a  knowledge  base,  called  plausibil¬ 
ity  distributions ,  determined  directly  by  the  statistics  of  a 
training  set  of  image  primitives  (in  this  sense  similar  to 
bayesian  techniques),  but  which  can  also  deal  with  certain 
kinds  of  uncertainty  inherent  in  the  training  set.  In  par¬ 


ticular  we  address  the  situation  where  the  training  set  is 
sufficient  to  show  where  an  object  lies  in  feature  space,  but 
the  a  priori  estimate  of  seeing  that  object  in  the  world,  as 
estimated  by  the  number  of  samples  in  the  training  set, 
is  inadequate  with  regard  to  identifying  objects  in  feature 
space. 

4.1  Criteria  for  Plausibility  Distributions 

The  approach  taken  by  Hanson  and  Riseman  [Hans86] 
describes  a  method  for  heuristically  assigning  a  rule  score 
based  on  the  feature  distributions  for  a  particular  object 
in  relation  the  feature  distribution  for  the  entire  image  (or 
set  of  images).  Our  intent  is  to  use  statistical  information 
to  automatically  constructs  a  knowledge  base  of  plausibil¬ 
ity  distributions  given  information  information  about  the 
frequency  distribution  of  a  set  of  training  instances. 

The  need  is  to  find  some  function  which  characterizes 
the  area  in  feature  space  in  which  an  object  lies  and  also 
takes  into  account  the  relationship  between  the  object  and 
total  world  sample.  If  we  pick  a  set  of  pixels  to  represent 
some  object,  and  a  set  of  features  over  which  to  charac¬ 
terize  each  object,  then  we  have  the  following  statistical 
information:  the  frequency  of  seeing  a  particular  feature 
value,  h(f)  and  the  frequency  of  seeing  a  particular  feature 
value  and  a  particular  object,  h(f  A  o).  (See  Figure  4.1.) 


Figure  4.1:  Frequency  distributions 
The  larger  curve  represents  the  frequency  of  seeing  a  partic¬ 
ular  feature  value,  denoted  by  h(f).  This  is  the  distribution 
of  the  feature  over  the  entire  population  of  samples.  The 
smaller,  shaded,  curve  represents  the  frequency  of  seeing  a 
particular  feature  value  and  a  particular  object,  h(f  A o),  the 
distribution  of  the  feature  over  only  those  samples  labeled 
object  o. 

Some  considerations  in  evolving  a  system  for  producing 
automatic  plausibility  distributions  include: 

Statistical  Accuracy:  What  statistical  information  is  avail¬ 
able  from  a  training  set  of  object  instances,  what  as¬ 
sumptions  can  be  made  about  the  accuracy  of  these 
statistics,  and  how  useful  is  this  information  for  dis¬ 
crimination  and  recognition? 


Semantics  of  a  Plausibility  Value:  What  are  the  spe¬ 
cific  semantics  of  a  plausibility  value  with  respect  to 
our  plausibility  formalism. 

Sine  of  the  Training  Set:  Should  the  final  plausibility  dis¬ 
tribution  be  independent  of  the  number  of  positive 
training  instances  for  a  particular  object  in  the  world? 
For  example,  if  30%  of  the  training  instances  are  objecti 
and  20%  are  object v  should  the  relative  plausibility 
of  finding  objecti  over  object ,  be  dependent  upon  the 
probability  of  picking  objecti  out  of  the  training  set 
at  random  (this  is  the  classic  discussion  about  the 
validity  of  a  priori  statistics)?  It  may  or  may  not  be 
reasonable  to  assume  a  priori  that  certain  objects  are 
more  likely  to  be  seen  than  others. 

We  briefly  discuss  each  of  these  considerations  in  the 
next  three  sections. 

4.1.1  Statistical  Accuracy 

We  must  first  address  the  ,uestion  of  the  accuracy  of 
statistics  collected  over  a  set  of  training  instances.  Any 
matching  or  inference  technique  is  only  as  good  as  the  rep¬ 
resentation  of  the  world  knowledge  as  defined  by  a  train¬ 
ing  set,  and  the  quality  of  the  information  returned  by  the 
processes  which  measure  the  features  to  be  compared  with 
the  world  knowledge.  In  complex  domains,  both  areas  are 
subject  to  uncertain,  imprecise  and  occasionally  inaccurate 
information  [Low82|.  One  concern  with  probabilistic  meth¬ 
ods  is  the  amount  of  information  needed  to  assure  quality 
statistics.  To  accurately  compute  the  conditional  probabil¬ 
ity  of  seeing  a  particular  state  of  nature  u>  given  a  feature 
measure,  p(u>  |  /),  requires  the  a  priori  knowledge  of  see¬ 
ing  u >  given  no  other  information  p(w),  the  probability  of 
perceiving  a  particular  feature  measure  p(/),  and  the  prob¬ 
ability  of  seeing  a  particular  feature  measure  given  a  state 
of  nature  p[f  |  w).  At  best,  this  information  is  difficult 
to  obtain  or  reliably  estimate.  Sources  of  uncertainty  can 
arise  through  innacurate  feature  measures,  incomplete  data 
sets,  aliasing  resulting  from  digitization  and  region  segmen¬ 
tation,  and  inacurate  region  segmentation. 

4.1.2  Semantics  of  a  Plausibility  Value 

Secondly,  we  look  at  the  desired  semantics  of  a  plau- 
sibilty  value.  The  objective  at  this  stage  of  interpretation 
cannot  be  emphasized  enough.  When  the  interpretation 
begins,  each  token  is  possibly  any  object  contained  in  the 
frame  of  discernment  or  the  unknown  event,  moreover  the 
features  used  to  characterize  object  hypothesis  at  this  level 
are  extremely  primitive.  Therefore  the  goal  is  not  object 
classification,  but  rather  a  pruning  of  possible  hypotheses. 
With  this  in  mind,  the  minimum  semantic  requirement  of 
a  plausibility  value  is  that  it  represent  how  much  an  object 
should  not  be  ruled  out  given  a  particular  feature  measure. 


4.1.3  Size  of  the  Training  Set 

Finally,  some  understanding  must  be  reached  about  the 
size  of  the  sample  set  for  each  object  in  the  frame  of  discern¬ 
ment.  Using  Bayesian  techniques,  the  conditional  probabil¬ 
ity  p(u>  |  /)  represents  both  the  probability  of  seeing  object 
w,  p(w),  and  knowledge  about  the  probability  of  seeing  fea¬ 
ture  measure  /  given  vj,  p(f  \  ui). 

Consider  the  two  object  case,  with  objects  uq  and  w2.  Given 
equal  probability  for  p(f  \  uq)  and  p(f  |  u>2)  then  the  de¬ 
cision  process  is  based  solely  on  the  the  values  for  p(uq ) 
and  p(u;2).  In  our  paradigm  of  using  sample  regions  hand 
labeled  as  the  training  set,  then  these  two  a  priori  proba¬ 
bilities  are  based  solely  on  the  number  of  samples  for  any 
object  in  the  training  set.  This  is  not  an  adequate  estimate. 
A  compromise  has  been  suggested  that  the  a  priori  proba¬ 
bility  p(u>)  is  assumed  equal  for  all  objects  and  thus  can  be 
removed  from  the  calculation  leaving  the  likelihood  ratio, 

p(/  H 
p(/)  ' 

A  ratio  greater  than  1  indicates  that  the  feature  measure 
/  is  more  likely  to  occur  for  this  particular  object  than  for 
some  random  object.  A  ratio  equal  to  1  indicates  no  infor¬ 
mation,  and  less  than  1  states  that  the  feature  value  /  is 
more  likely  to  occur  by  chance  for  any  object  than  it  is  for 
a  particular  object  uq.  This  ratio  is  the  theoretical  foun¬ 
dation  underlying  the  heuristic  approach  of  Hanson  and 
Riseman  [Hans86]  and  has  also  been  employed  by  Woods 
in  speech  understanding  [Wood78].  If  the  a  priori  statistics 
are  assumed  to  be  innacurate,  this  is  still  not  a  reasonable 
solution  because  the  a  priori  statistic  p(u>)  is  still  present 
in  the  term  p(f  |  ui). 

If  it  is  reasonable  to  assume  that  certain  object  are  more 
likely  to  appear  than  other  object,  then  some  estimate  of 
the  a  priori  probability  may  be  devised.  On  the  other  hand, 
if  you  assume  that  the  a  priori  probabilities  are  ill  defined  at 
best,  then  it  is  not  good  enough  to  simply  remove  that  term 
from  the  computation  of  the  conditional  statistic,  rather 
some  effort  must  be  made  to  factor  out  the  size  of  the  sam¬ 
ple  over  which  the  probabilities  are  calculated. 

In  the  domain  considered  here  the  number  of  samples 
for  a  given  object,  o,  is  not  be  an  adequate  estimate  of  the  a 
priori  probability  of  seeing  that  object  in  the  world.  Indeed 
this  a  priori  probability  may  be  impossible  to  determine. 
However  the  number  of  samples  does  influence  h(f)  and 
h({  A  o).  What  is  needed  is  to  find  the  range  of  feature 
values  associated  with  a  given  object  and  to  factor  out  any 
effects  related  to  the  probability  of  picking  an  object  at 
random  out  of  the  training  set. 

In  summary,  early  work  on  the  generation  of  feature 
rules  by  Hanson  and  Riseman  suggests  that  the  rule  for  an 
object  should  be  influenced  not  only  by  the  its  conditional 


525 


feature  distribution,  but  also  by  its  relationship  to  the  fea¬ 
ture  distribution  of  all  the  objects  in  the  entire  training  set. 
In  addition,  we  set  as  our  goal  that  the  final  plausibility  dis¬ 
tribution  should  be  characterizations  of  feature  ranges  and 
not  include  information  about  a  priori  probabilities. 


4.2  Automatic  Generation  of  Plausibility 
Distributions 


with  respect  to  two  objects  with  the  same  distribution  but 
with  difference  sample  sizes.  Given  two  objects,  o,  and 
o2,  we  can  show  that  Pl{ot  |  /)  =  P/(o2  |  /)  regardless 
of  the  difference  in  the  size  of  p(oi)  and  p(o2)  as  long  as 
the  distibution  for  the  two  objects  occupy  the  same  area  in 
feature  space. 

Theorem:  Let  p(f  A  o2)  =  ap(f  A  o,)  and  p(o2)  =  ap(oj) 
where  a  is  some  scalar,  then 


For  a  given  feature  value  consider  the  ratio  of  the  height 
of  a  frequency  distribution  for  a  given  object,  h(f  A  o,),  to 
the  height  of  the  frequency  distribution  for  all  training  ob¬ 
jects,  h(f)  (see  Figure  4.1).  This  is  by  definition  an  estimate 
of  the  conditional  probability  1 

*i  fl-T/T1-  <«> 


P(f  A  02)  =  ocp(f  A  Oi)  _  p(f  A  Oi) 

P{f)p{oi)  p(/)op(o,)  p(/jp(o,j 

We  show  in  figure  (4.2)  that  this  function  indeed  charac¬ 
terizes  the  proportion  of  feature  space  in  which  the  objects 
feature  distribution  lies  and  minimizes  the  effect  of  the  sam¬ 
ple  size  over  which  the  statistics  are  taken  as  stated  by  the 
above  theorm. 


I 

i 

t 

I 


Similarly  defined  are  the  estimates  p(o;),  the  relative  fre¬ 
quency  of  seeing  object,  in  the  knowledge  base  and  p(/), 
the  relative  frequency  of  seeing  a  particular  feature  value 
/• 


jofobjectj  __  E /M/  A  o,) 

E"=i  to f object,  Efh(f) 


(4.2) 


P(/)  = 


M/) 

E/M/) 


(4.3) 


As  mentioned  earlier,  one  desirable  characteristic  of  a 
plausibility  distribution  is  that  it  be  relatively  invariant 
with  respect  to  the  size  of  the  training  set  used  to  model 
the  feature  distribution.  That  is,  two  distributions  differ¬ 
ing  only  in  the  size  of  the  sample  set  over  which  they  are 
defined  should  have  the  same  plausibility  distribution.  We 
can  show  that  one  way  of  accomplishing  this  behavior  is 
to  use  the  estimate  of  the  a  priori  probability,  p(o,),  as  a 
decision  threshold  on  the  conditional  probability  p(o,  [  /). 
If  the  value  of  p(o,  |  /)  is  at  least  as  large  as  the  estimate  of 
seeing  o,,  p(o,),  assume  a  plausibility  value  of  1,  otherwise 
normalize  by  p(o<)  We  now  have  the  following  definition  for 
a  plausibility  function. 


Definition:  For  each  object  o,  and  each  feature  f  we  de¬ 
fine  a  plausibilty  value  for  object  o,  given  feature  f  as 
follows :  2 


Pl(o,\})  ,-  (4.4) 

P\f)P(°>) 

We  can  now  examine  the  behavior  of  a  plausibility  value 
with  respect  to  the  size  of  the  training  set  that  makes  up 
our  statistical  samples.  To  do  this  we  must  look  at  the 
behavior  of  p(/  a  o.) 

_ _ am i  H-5) 

1  We  will  one  the  notation  to  indicate  the  use  of  estimate*  of  prob¬ 
abilities  based  on  discrete  samples. 

2  A  full  discussion  of  this  definition  and  proof  of  related  theorms  can 
be  found  in  |Rey86b). 


5.  Using  Plausibility  Functions  for  Initial 
Hypothesis 

In  this  experiment  we  started  with  six  images  of  New 
England  road  scenes  digitized  to  a  resolution  of  256  x  256 
pixels.  Each  image  was  then  segmented  using  a  knowledge 
based  segmentation  technique  lHans87|.  Earh  region  was 
hand  labeled  as  one  of  the  following  labels  {ambiguous, 
barn,  dirt,  foliage,  grass,  gravel,  house,  phonepole,  pole, 
post,  railing,  road,  roadline,  sign,  sky,  sky-tree,  tree-trunk, 
wire),  but  only  objects  with  a  significant  number  of  occur¬ 
rences  in  the  training  set  were  used  in  the  frame  of  discern¬ 
ment.  The  context  for  this  experiment  is  defined  by  the 
following  question,  frame  of  discernment  and  set  of  feature 
spaces: 

Question:  “What  are  the  plausible  semantic  labels  for  this 

_•  on 

region  : 

Frame  Of  Discernment:  {foliage,  grass,  gravel,  road,  road¬ 
line,  sky,  sky-tree,  trunk).  3 

Feature  Spaces:  Four  feature  categories  were  used;  In¬ 
tensity,  Color,  Location  and  Texture. 

•  Intensity  Features:  Y  color  transform.  Inten¬ 
sity  color  transform,  Raw  red,  Raw  green,  Raw 

blue.  . 

•  Color  Features:  Percentage  of  red,  Percent-  i 

age  of  green,  Percentage  of  blue,  Excess  red,  Ex-  ' 

ceos  green,  Excess  blue,  Hue,  Saturation,  Q  color  ■ 

transform,  I  color  transform.  i 

f 

•  Texture  Features:  Horizontal  edge  measure, 

Vertical  edge  measure,  Lower  diagonal  edge  mea¬ 
sure,  Upper  diagonal  edge  measure. 

•  Location  Features:  Row  position  of  centroid. 

3The  unknown  object  is  not  explicit  but  rather  implicitly  repre¬ 
sented  by  the  conflict  value  k. 


526 


iisihmiiii 


-( _  f— i— i —  <  ^  < 


n  ruiciioi 


Figure  4.2:  Plausibility  distributions  of  synthetic  world  knowledge. 
This  graph  shows  the  plausibility  distributions  for  synthetic  world  knowledge  as  defined  by  a 
set  of  four  gaussian  distributions.  On  the  left  is  displayed  h(f),  the  sum  of  the  gaussians,  with 
h{f  A o)  shown  separately  from  top  to  bottom.  On  the  right,  the  four  plausibility  distributions 
are  displayed.  Note  that  the  top  two  distributions  have  the  same  mean  and  standard  deviation, 
and  their  plausibility  distributions  are  identical.  The  dotted  horizontal  lines  pass  through  p(o) 


No  information  about  object  size  or  shape  were  used  in 
this  set  of  experiments.  The  objective  was  to  use  a  sim¬ 
ple  set  of  feature  measures  to  reduce  the  set  of  possible 
object  hypotheses  for  any  particular  region.  In  particular, 
the  feature  measures  used  here  contain  little  or  no  spe¬ 
cial  knowledge  which  relates  to  a  specific  context.  The 
approach  is  not  strictly  classification,  but  rather  to  pro¬ 
vide  some  initial  evidence  for  an  hypothesis.  The  initial 
hypothesis  can  then  provide  information  about  a  more  spe¬ 
cific  context  to  be  used  as  interpretation  continues;  with 
this  more  specific  context  are  more  specific,  perhaps  more 
expensive,  feature  measures.  Under  current  development  is 
a  system  for  extending  and  verifying  an  initial  hypothesis 
using  a  high-level  knowledge  based  system  implemented  in 
a  black  board  architecture  (Weym86,Drap86|. 


The  knowledge  base  for  each  feature  space  was  formed 
over  the  feature  distribution  of  the  hand  labeled  regions 
from  all  six  images.  The  feature  spaces  were  specifically 
designed  to  use  only  features  that  could  be  measured  over 
pixels.  For  each  feature,  a  feature  plane  is  defined  which 
encodes  a  feature  value  at  every  point  in  the  image.  The 
feature  planes  and  the  training  regions  are  then  used  to 
create  a  pixelwise  frequency  distribution  for  each  object  in 
the  frame  of  discernment.  An  alternative  is  to  define  the 
frequency  distributions  by  the  mean  feature  value  defined 
over  the  training  regions  and  to  weight  the  mean  value  by 
the  size  of  the  region.  In  the  latter  method,  a  high  degree  of 
smoothing  is  required  to  produce  a  robust  frequency  distri¬ 
bution.  In  the  pixelwise  method  only  minimal  smoothing 
was  used.  During  the  interpretation  phase,  each  knowledge 


527 


i  i.to.  (i  simki  iiH.gg.  iii  («.gg.  gi  urn  iw.gg.  sss?i 


Figure  5.X:  Plausibility  distributions  of  Intensity. 

These  graphs  show  the  feature  histgrams  and  resulting  plausibility  functions  for  the  objects 
foliage,  grass,  gravel,  road,  roadline,  sky,  sky-tree,  and  total.  For  each  object,  the  lower  curve 
is  the  histogram  of  the  actual  feature  values  for  the  regions  in  the  training  set.  The  upper 
curve  is  the  resulting  plausibility  function.  Each  object  histogram  is  scaled  with  respect  to 
the  number  of  samples  contained  in  the  training  set  for  that  particular  object.  The  histogram 
for  “total”  in  the  lower  right  hand  corner  represents  all  regions  in  the  training  set  including 
those  whose  labels  are  not  included  in  the  set  of  possible  initial  hypotheses. 


source  returns  an  evidential  model  based  on  the  mean  fea¬ 
ture  value  of  the  region  in  question. 

In  addition,  no  plausibility  was  allowed  to  recieve  a 
value  of  zero.  Due  to  the  multiplicative  nature  of  our  com¬ 
bining  function,  a  value  of  zero  could  cause  a  hypothesis 
to  be  completely  ruled  out  based  on  errorful  information. 
Instead,  all  plausibility  values  were  normalized  between  .1 
and  I.  Figure  5.1  shows  the  plausibility  functions  gener¬ 
ated  for  the  feature  intensity,  where  intensity  is  defined  as 
R  +  G  +  B/ 3. 


5.1  Understanding  Conflict 

In  Section  3.  we  discuss  a  combination  function  for  plau¬ 
sibility  functions  which  parallels  the  Dempster’s  Rule  ap¬ 
plied  to  mass  functions.  This  combination  function  uses 
(1  -  k)  as  a  normalization  constant.  This  section  introduces 
other  possible  uses  of  normalisation  and  different  nomali- 
sation  constants  when  conflict  is  due  to  a  situation  other 
than  the  disagreement  of  knowledge  sources. 

An  approach  taken  in  the  rule  system  is  to  load  the  sys¬ 
tem  with  redundant  information  so  that  no  one  knowledge 
source  contributes  a  significant  amount  of  information  to 
the  interpretation  process.  This  is  desirable  if  any  of  the 


528 


knowledge  sources  are  suspected  to  be  unpredictably  er- 
rorful.  Unfortunately,  in  the  approach  presented  here,  as 
the  number  of  objects  and  feature  spaces  increase,  so  does 
the  conflict  between  plausibility  functions  in  the  interpre¬ 
tation  of  a  token.  In  particular  the  automatic  plausibility 
distributions  may  allow  many  objects  to  receive  a  small 
plausibility  value,  ruling  out  completely  no  one  object  for 
any  one  feature  value.  For  many  tokens,  each  object  in 
the  frame  of  discernment  will  be  ruled  out  to  some  signifi¬ 
cant  degree  by  at  least  one  piece  of  evidence  due  to  errorful 
data  (e.g.  innacurate  feature  measurement  due  to  aliasing 
or  poor  placement  of  a  segmentation  boundry).  This  situ¬ 
ation  introduces  conflict  into  the  final  interpretation  which 
is  not  due  to  the  detection  of  an  unknown  image  event. 

Using  the  normalization  of  ( 1  -  fc)  we  are  able  to  explic¬ 
itly  represent  an  unknown  event  by  the  value  of  fc  in  each 
plausibility  function.  Intuitively,  the  amount  to  which  two 
knowledge  sources  do  not  agree,  k ,  indicates  the  amount  to 
which  the  correct  initial  hypothesis  is  not  contained  in  the 
frame  of  discrenment.  Normalizing  a  combined  plausibility 
function  by  (1  -  k)  produces  plausibility  values  identical 
to  the  plausibilities  over  the  singletons  of  the  equivalent 
mass  function  obtained  using  Dempster’s  Rule  and  allows 
an  unknown  event  to  occur  and  be  represented  as  conflict. 

It  has  been  argued  that  normalization  is  used  to  elimi¬ 
nate  or  hide  a  contradiction  of  aggregate  evidence  [Zad83|. 
It  must  be  noted  that  with  a  multiplicative  combination 
function,  the  use  of  no  normalization  will  create  monoton- 
ically  decreasing  plausibility  values.  This  means  that  any 
evidence  supplying  a  plausibility  value  less  than  1.0  can 
only  detract  from  a  hypothesis,  and  the  more  evidence  that 
is  accumulated,  the  more  the  system  will  become  suscepti¬ 
ble  to  errorful  data.  The  use  of  a  normalization  by  (1  -  fc) 
however  keeps  the  system  from  being  a  monotonically  de¬ 
creasing  system,  preserves  the  semantics  of  the  plausibility 
values  as  the  plausibilities  of  the  singleton  subsets  of  the 
related  mass  function,  and  preserves  the  representation  of 
the  unknown  object  as  conflict. 

6.2  Interpretation  Strategies  and  Future 
Work 

In  these  experiments,  the  Dempster-Shafer  formalism  is 
not  used  as  an  end  to  reasoning,  but  rather  as  a  represen¬ 
tation  formalism  for  evidence.  The  interpretation  strategy 
to  be  employed  after  initial  hypothesis  generation  dictates 
some  of  the  desired  qualities  for  a  decision  over  the  set  of 
initial  hypothesis.  One  approach  might  be  to  use  the  ini¬ 
tial  hypothesis  to  indicate  exemplar  regions,  regions  which 
most  look  like  grass  for  example.  These  regions  can  then  be 
used  to  constrain  the  interpretation  of  neighboring  regions, 
on  the  other  hand,  the  initial  hypothesis  can  be  viewed  as 
simply  reducing  the  set  of  possible  hypothesis. 

Several  different  decision  strategies  were  suggested  cre¬ 
ating  the  initial  set  of  hyotheses  for  this  experiment.  The 
simplest  strategy  is  to  normalize  out  all  conflict  contained  in 
a  plausibility  function  (by  dividing  the  plausibility  function 


by  the  maximum  plausibility  value)  and  use  these  numbers 
for  ranking  objects.  This  strategy  does  not  allow  the  rep¬ 
resentation  of  the  unknown  object.  Normalizing  by  ( 1  -  fc) 
allowed  an  explicit  representation  of  the  unknown  event. 
Another  strategy  suggested  uses  the  idea  of  consistent  ev¬ 
idence.  A  simple  strategy  can  be  used  to  find  the  most 
likely  object.  Next,  all  plausibility  functions  which  rule  out 
to  some  significant  degree  the  top  ranking  object,  as  pro¬ 
duced  by  the  simple  strategy,  are  then  removed  from  the 
final  combination  of  evidence. 

The  decisive  factor  for  determining  the  best  decision 
strategy  at  any  step  in  the  interpretation  should  be  the 
context  and  objectives  of  the  interpretation  system  as  a 
whole.  There  may  be  only  a  few  relevant  control  (control 
about  conflict)  strategies,  or  control  may  become  a  highly 
variant  context  dependent  process. 

6.  Results  and  Figures 

The  hypotheses  displayed  in  this  section  are  initial  hy¬ 
potheses  to  be  used  by  a  knowledge-directed,  opportunistic 
image  interpretation  system.  The  initial  hypotheses  can 
verified  and  used  to  create  a  contextual  environment  in 
which  to  hypothesise  objects  not  defined  by  the  frame  of 
discernment,  hallucinate  the  merging  or  splitting  of  image 
tokens,  and  in  general  aid  in  the  development  of  a  mapping 
between  image  events  and  world  events. 

We  chose  the  following  two  step  combination  scheme 
using  the  multiplicative  combination  function  defined  by 
formula  3.3: 

1.  Combine  all  the  plausibility  functions  for  each  of  the 
four  feature  spaces  (i.e.  intensity,  color,  texture  and 
location)  and  normalize  each  by  (1-fc),  with  fc  defined 
by  for  each  plausibility  function. 

2.  Combine  the  four  resulting  plausibility  functions  and 
again  normalize  by  (1-fc),  with  fc  defined  by  the 
combined  plausibility  function. 

By  combining  the  evidence  for  each  category  of  features 
first,  we  were  able  let  each  group  have  equal  weight  in  the  fi¬ 
nal  combination.  For  instance,  there  are  ten  color  features, 
but  only  one  location  feature.  By  combining  the  color  fea¬ 
tures  and  renormalizing  before  combining  with  the  location 
feature,  we  were  able  to  represent  color  and  location  equally 
in  the  final  interpretation.  Once  the  evidence  contained  in 
each  plausibility  function  is  combined,  a  mass  function  is 
constructed  as  discussed  in  Section  3.  to  represent  the  final 
evidential  model. 

Figure  6.2  through  figure  6.6  each  show  four  pieces  of 
information  for  a  particular  object  hypothesis.  The  terms, 
support,  plausibility,  singleton  objects,  and  moss  value  are 
used  in  terms  of  the  Dempster-Shafer  theory  of  evidence. 
Brief  definitions  of  these  terms  can  be  found  in  Appendix 
A.  For  each  piece  of  information,  the  intensity  encodes  a 
numeric  value  with  darker  regions  representing  higher  num¬ 
bers. 


529 


•  The  top  left  quadrant  displays  all  the  regions  which 
would  answer  the  question  “What  are  all  the  regions 
initially  hypothesised  as  object  X?”,  where  X  repre¬ 
sents  the  object  under  consideration.  An  object  is 
considered  as  an  inital  hypothesis  if  it  is  contained  in 
the  highest  ranking  subset  of  the  final  mass  function. 

•  The  top  right  quadrant  shows  the  mass  value  assigned 
to  the  highest  ranked  subset  for  those  regions  dis¬ 
played  in  the  top  left  quadrant. 

•  The  bottom  left  shows,  for  every  region  in  the  image, 
the  support  value  for  the  set  containing  the  singleton 
object  under  consideration.  The  support  is  computed 
from  the  final  mass  function  and  can  be  thought  of 
as  the  lower  bound  on  the  system’s  belief  that  this 
region  is  this  object. 

•  Finally,  the  bottom  right  displays  the  plausibility  value 
for  the  set  containing  the  singleton  object  under  con¬ 
sideration.  Again,  the  plausibility  is  computed  from 
the  final  mass  function  and  can  be  thought  of  as  the 
upper  bound  on  the  system’s  belief  that  this  region  is 
this  object. 

In  figure  6.7,  the  combined  conflict  obtained  for  each 
region  is  displayed.  A  high  conflict  value  (i.e.  a  dark  region) 
indicates  a  large  amount  of  disagreement  between  separate 
pieces  of  evidence.  This  is  the  value  for  k  contained  in 
the  normalization  constant  of  the  combination  function  (see 
formula  3.2).  The  conflict  measure  can' be  used  to  pinpoint 
areas  which  may  need  more  verification.  Also  of  interest 
is  the  conflict  obtained  by  each  intermediate  combination. 


This  is  displayed  from  left  to  right,  top  to  bottom  as  conflict 
from  color,  intensity,  texture  and  location  respectively  in 
figure  6.8. 

7.  Conclusion 

We  have  presented  here  a  method  which  automatically 
generates  a  knowledge  base  for  the  formation  of  initial  ob¬ 
ject  hypotheses  using  statistical  information  provided  from 
a  set  of  training  objects.  The  plausibilty  formalism  uses  a 
computationally  efficient  approach  to  the  Dempster-Shafer 
formalism  for  representing  and  combining  evidence.  The 
Dempster-Shafer  formalism  is  used  for  its  rich  evidential 
representation.  Our  system  addresses  the  problems  of  us¬ 
ing  heuristic  methods  for  constructing  a  knowledge  base 
and  combining  evidence  as  well  as  the  problems  of  a  strict 
Bayesian  approch.  Presented  is  a  set  of  results  showing  our 
plausibility  theory  applied  to  a  set  of  color  outdoor  road 
scenes  which  shows  that  the  approach  has  significant  po¬ 
tential. 

8.  Acknowledgements 

We  would  to  thank  Edward  Riseman  and  Allen  Hanson 
for  their  guidance  and  careful  reading  of  this  paper.  In 
addition  thanks  go  to  Deborah  Strahman  for  sharing  her 
breadth  of  knowledge  in  the  areas  of  probable,  possible, 
and  plausible  reasoning. 


Figure  6.1:  Intensity  Image 
The  intensity  image  digitized  to  a  resolution  of  256  by  256. 
The  intensity  was  computed  from  the  red,  green,  and  blue 
planes  using  the  formula  R  +  G  +  B/Z. 


Figure  6.2:  Initial  Hypotheses  for  Road 


Figure  6.3:  Initial  Hypotheses  for  Road  Line 


531 


Figure  6.8:  Conflict  for  Color,  Inten¬ 
sity,  Texture,  and  Location 


A.  Dempster-Shafer  Tutorial 

A.l  Dempster-Shafer  —  An  Intuitive  Ap¬ 
proach 

The  Dempster-Shafer  theory  of  evidence  provides  a  rich 
set  of  semantics  for  representing  and  combining  evidence 
and  can  be  viewed  as  a  least  commitment  approach.  Rac.h 
piece  of  evidence  is  intended  to  reduce  a  set  of  possible  hy¬ 
potheses.  The  consensus  of  many  pieces  of  evidence  is  then 
used  to  focus  on  the  smallest  possible  set  of  hypotheses. 
Dempster’s  combination  rule  is  the  heart  of  the  system  and 
much  literature  can  be  found  describing  its  focusing  meth¬ 
ods  [Shaf76,Lu8'1,Low82,Wes84,Rry86a,Rey86b|. 

A. 2  Terms  and  Concepts 

The  Dempster-Shafer  theory  of  evidence  can  be  broken 
up  into  several  concepts: 

Frame  Of  Discernment:  A  framework  which  defines  a 
set  of  possible  hypotheses,  or  possible  answers  to  a 
given  question.  The  frame  of  discernment  defines  a 
set  of  assertions:  This  is  an  X.  where  .Y  is  an  ele¬ 
ment  of  the  frame  of  discernment.  A  frame  of  dis¬ 
cernment  is  defined  to  contain  all  possible  answers  to 
the  question  to  be  answered.  This  assumption  can  be 


fulfilled  by  the  addition  of  some  “unknown"  object 
which  represents  all  answers  not  explicitly  contained 
in  the  frame  of  discernment.  The  set  which  contains 
all  elements  of  the  frame  of  discernment  is  represented 
by  ©. 

A  Mass  Function:  A  structure  for  representing  evidence 
which  maps  a  unit  of  belief  over  the  powerset  of  the 
frame  of  discernment.  The  mass  value  assigned  to 
a  subset  of  the  frame  of  discernment  quantifies  how 
much  the  belief  is  contained  in  the  system  that  exactly 
that  subset  of  hypotheses  is  the  best  answer  to  the 
question. 

A  Feature  Space:  Each  feature  space  is  fashioned  to  pro¬ 
vide  information  needed  to  answer  some  particular 
aspect  of  the  given  question.  A  set  of  feature  spaces 
are  designed  to  bring  together  many  aspects  of  the 
evidence  needed  for  an  inference. 

A  Knowledge  Source:  Maps  a  feature  value  for  a  spec¬ 
ified  feature  space  into  a  representation  suitable  for 
combining  with  other  evidence  given  a  combination 
rule.  In  this  case,  a  knowledge  source  maps  a  feature 
value  into  a  mass  function  to  be  combined  with  other 
pieces  of  evidence  using  Dempster’s  Rule. 


Dempster’s  Rule:  Dempster’s  Rule  combines  evidence  as 
represented  by  two  mass  functions  to  create  a  consen¬ 
sus  of  opinion.  Dempster’s  Rule  is  both  associative 
and  commutative  providing  means  for  several  pieces 
of  information  to  brought  together  as  combined  ev¬ 
idence.  Mathematically  Dempster's  Rule  is  the  or¬ 
thogonal  sum  of  two  mass  functions  and  is  defined  as 
follows:  For  every  C  t-  0 


mi  ®  m2(C) 


Ea  ifl-c  mt{A)  ■ m2(B) 
I  k' 

U  m,(A)  • 


k  is  precisely  a  statement  about  how  much  disagree¬ 
ment  exits  between  the  two  pieces  of  evidence. 


by  a  mass  function.  Of  initial  importance  is  the  information 
obtained  by  asking  the  following  questions  for  a  given  sub¬ 
set  A\  “What  amount  of  of  evidence  supports  A  as  being 
the  correct  answer?”,  “What  amount  of  evidence  refutes  A 
as  being  the  correct  answer?”  and  “What  amount  of  evi¬ 
dence  fails  to  refute  A  as  being  the  correct  answer?”.  The 
values  obtained  by  asking  these  questions  are  the  referred 
to  as  support,  doubt  and  plausibility  respectively. 

Spt(A)  =  J2  m(s)- 


Pfs(A) 


Dbt(A)  =  m(5)- 

srA=Q 

=  mis)  =  1  -  Spt(-'A). 


meni  exits  oetween  tne  two  pieces  ot  evidence.  Figure  A A  shows  graphically  the  relationship  between 

Support,  Plausibility,  Doubt,  and  Conflict:  Metrics  which  the  SUpport’  doubt  and  Plausibility- 
describe  the  degree  to  which  some  member  or  some 
subset  of  the  frame  of  discernment  is  the  correct  an¬ 
swer  to  a  given  question.  I  | 


In  the  Dempster-Shafer  approach,  the  object  spare  is  de¬ 
fined  over  the  powerset  of  all  possible  objects.  A  hypothesis 
then  is  not.  merely  an  element  of  the  frame  of  discernment, 
but  rather  a  set  of  objects.  Each  knowledge  source  is  re¬ 
sponsible  for  distributing  a  unit  of  belief  across  the  subsets 
of  the  frame  of  discernment.  The  resulting  list  of  object 
subsets  and  mass  values  is  a  mass  function.  Higher  mass 
values  are  associated  with  more  likely  hypotheses.  Three 
restrictions  apply  to  mass  functions;  each  subset,  receives  a 
mass  value  between  zero  and  one,  the  empty  set  receives 
zero  mass,  and  the  sum  of  the  mass  values  over  the  ob¬ 
ject  space  equals  one.  For  a  complete  discussion  of  mass 
functions  see  jShafer’. 

In  general,  an  inference  problem  consists  of  a  question  to 
be  answered,  a  frame  of  discernment  containing  all  possible 
answers  to  this  question,  and  a  set  of  feature  spares  each 
designed  to  provided  some  piece  of  information  useful  in 
answering  the  question  at  hand.  This  collection  is  termed 
a  context. 

A. 3  Specific  attributes  of  the  Shafer  Demp¬ 
ster  Approach 

As  well  as  providing  a  combination  rule  for  bringing  to¬ 
gether  disparate  sources  of  evidence,  the  Dempster-Shafer 
approach  provides  explicit  information  about  supporting 
evidence,  plausible  evidence,  uncertainty  of  a  derision,  con¬ 
flict  between  knowledge  sources,  disbelief  and  no  belief. 
The  following  subsections  describe  these  types  of  informa¬ 
tion. 

A. 3.1  Supports,  Doubt,  and  Plausibility 

The  mass  values  associated  with  each  subset  in  the  ob¬ 
ject  space  is  only  a  small  portion  of  the  information  supplied 


Figure  A.l:  Venn  diagram  of  support, 
plausibility  ami  doubt. 

Spt(A)  =  m(A)  +  m(C). 

Pls(A)  —  m(A)  +  m(C)  +  m(D)  +  m(E). 

Ubt(A)  =  m(B). 

A. 3. 2  Knowledge  about  Conflict 

Implicit  with  each  combined  mass  function  is  a  state¬ 
ment  about  how  the  knowledge  sources  involved  in  the  de¬ 
cision  process  agree  with  one  another.  This  is  known  as  the 
conflict  value  and  is  precisely  defined  by  Dempster’s  Rule 
as: 

k  =  m1(A)-m2(B).  (A.3) 

AnB-9 

One  assumption  mentioned  earlier  is  that  all  possible 
objects  are  represented  in  the  frame  of  discernment.  This 
may  be  accomplished  by  the  addition  of  the  unknown  ob¬ 
ject  in  the  frame  of  discernment.  If  all  knowledge  sources 
are  required  to  give  some  mass  to  the  unknown  object,  then 
k  -  0  for  all  combinations  of  mass  functions  and  the  con¬ 
flict  value  is  precisely  the  amount  of  mass  assigned  to  this 
unknown  object. 

A. 3. 3  Disbelief  versus  No  Belief 


535 


Disbelief  and  no  belief  can  be  discussed  both  in  terms 
of  a  mass  function  as  a  whole  and  about  a  particular  subset 
receiving  mass  within  a  mass  function.  Looking  at  a  mass 
function  as  a  whole,  statements  about  support  and  plau¬ 
sibility  are  direct  statements  about  disbelief  and  no  belief. 

No  belief  is  represented  as  the  support  for  some  subset  being 
near  zero,  where  as  disbelief  is  represented  as  a  plausibility 
near  zero.  Looking  at  the  mass  values  themselves.  If  for 
a  given  subset  A  the  mass  value  is  near  one,  then  for  any 
a  Q  A.  the  system  provides  little  information  (no  belief)  for 
the  support  of  a  but  much  information  about  its  disbelief 
for  any  b  $  A.  Disbelief  about  the  systems  performance  in 
general  (or  about  a  knowledge  source)  is  contained  in  the 
appropriate  conflict  value. 

Bibliography 

(Binf82|  T.  Binford,  Survey  of  Model  Based  Image 
Analysis  Systems,  International  Journal  of 
Robotics  Research  1  (1982),  pp.  18-64. 

[Belk86|  R.  Belknap,  E.  Riseman  and  A.  Hanson 
(1985),  “The  Information  Fusion  Problem  and 
Rule-Based  Hypotheses  Applied  to  Complex 
Aggretations  of  Image  Events.”  Proceedings: 
OAR  PA  Image  Understanding  Workshop,  De¬ 
cember  1985,  pp.  279-292. 

[Burn86]  J.  Brian  Burns,  Allen  R.  Hanson  Si  Edward  M. 

Riseman.  “Extracting  Straight  Lines",  IEEE 
Trans,  on  Pattern  Analysis  and  Machine  In¬ 
telligence,  PAMI-8,  #4  (July  1986),  pp.  425- 
455. 

(Demp67|  A.  P.  Dempster  (1967),  “Upper  and  lower 
probabilities  induced  by  a  multivalued  map¬ 
ping",  Annals  of  Mathematical  Statistics,  vol. 
38,  1967,  pp.  325-339. 

[Demp68|  A.  P.  Dempster  (1968),  “A  generalization  of 
Bayesian  inference”,  Journal  of  the  Royal  Sta¬ 
tistical  Society,  Series  B,  vol.  30,  1968,  pp.  205- 
247. 

(Demp84|  A.  P.  Dempster  and  A.  Kong  (1984),  “Belief 
functions  and  communications  networks",  Re¬ 
port,  Department  of  Statistics,  Harvard  Uni¬ 
versity,  Cambridge,  MA,  Nov.  1984. 


|Faug82]  0.  D.  Faugeras  (1982),  “Relaxation  labeling 
and  evidence  gathering” ,  Proc.  Con}.  Pattern 
Recognition  and  Image  Proceesing,  Las  Vegas, 
June  1982,  pp.  672-677. 

[Hans78]  Hanson,  Allen  R.  Si  Riseman,  Edward  M.  “VI¬ 
SIONS:  A  Computer  System  for  Interpreting 
Scenes”,  in  Computer  Vision  Systems,  Hanson 
Sc  Riseman,  eds.,  Academic  Press,  N.Y.,  1978. 
pp.  303-333. 

[Hans86]  Hanson,  Allen  R.  Si  Riseman,  Edward  M. 

(1986),  “A  methodology  for  the  development 
of  general  knowledge-based  vision  systems", 
in  Vision,  Brain,  and  Cooperative  Computa¬ 
tion  (M.  Arbib  and  A.  Hanson  Eds.)  to  be 
published  by  MIT  Press  1987:  Also  COINS 
Technical  Report  86-27,  University  of  Mas¬ 
sachusetts  at  Amherst,  July  1986 

[IIans87j  Hanson,  Allen  R.  Si  Riseman,  Edward  M., 
“The  VISIONS  Image  Understanding  System 
-  86”,  in  Advances  in  Computer  Vision,  C. 
Brown  (Ed.),  Erlbaum,  1987. 

[Humm83]  R.  A.  Hummel  and  S.  W.  Zucker  (1983),  “On 
the  foundations  of  relaxation  labeling  pro¬ 
cesses",  IEEE  Trans.  Patt.  Anal.  Mach.  Intel., 
vol.  PAMI-5,  no.  3,  May  1983,  pp.  267-286. 

[Kitc80]  L.  J.  Kitchen  (1980),  “Relaxation  applied  to 
matching  quantitative  relational  structures”, 
IEEE  Trans.  Syst.  Man  Cybern.,  vol.  SMC- 10, 
1980,  pp.  96-101. 

f Kitc84 ]  L.  J.  Kitchen  and  A.  Rosenfeld  (1984),  “Scene 
analysis  using  region-based  constraint  filter¬ 
ing”,  Pattern  Recognition,  vol.  17,  no.  2,  1984, 
pp.  189-203. 

|Kohl84]  Ralf  R.  Kohler.  Integrating  Non-Semantic 
Knowledge  into  Image  Segmentation  Pro¬ 
cesses.  Ph.D.  thesis,  COINS,  Univ.  of  Mas¬ 
sachusetts,  Amherst,  1984. 

|Kybu84]  II.  E.  Kyburg  (1984),  “Bayesian  and  non- 
Bayesian  evidential  updating”,  Tech.  Report 
139,  Department  of  Computer  Science,  Uni¬ 
versity  of  Rochester,  Rochester  NY,  July  1984. 


|I)rap86]  Bruce  A.  Draper,  Allen  R..  Hanson  Si  Ed¬ 
ward  M.  Riseman  (1986),  “A  Software  Envi¬ 
ronment  for  High  Level  Vision”.  Proceedings 
AAAI  Workshop  on  Knowledge  Based  Systems 
and  Tools,  Columbus,  Ohio;  October  1986. 

(Duda73|  R.  Duda  and  P.  Hart  (1973),  Pattern  Clas¬ 
sification  and  Scene  Analysis.  A  Wiley- 
Interscience  Publication,  New  York.  1973. 


|Low82]  J.  C.  Lowrance  (1982),  “Dependency-graph 
models  of  evidential  support”,  Ph.D.  Thesis, 
University  of  Massachusetts,  Amherst,  1982. 

|Low83|  J.  D.  Lowrance,  T.  D.  Garvey  (1983),  “Evi¬ 
dential  reasoning:  an  implementation  for  mul¬ 
tisensor  integration”,  Tech  Note  307,  SRI  Ar¬ 
tificial  Intelligence  Center,  December  1983. 


536 


[Lu84]  S.  Y.  Lu,  H.  E.  Stephanou  (1984),  “A  set- 
theoretic  framework  for  the  processing  of  un¬ 
certain  knowledge”,  AAAI-84,  pp.  216-221. 

[Rey86a]  G.  Reynolds,  D.  Strahman  and  N.  Lehrer 
(1985),  “Converting  Feature  Values  to  Evi¬ 
dence”,  Proceedings:  DARPA  Image  Under¬ 
standing  Workshop,  December  1985,  pp.  331- 
339. 

[Rey86b]  G.  Reynolds,  D.  Strahman,  N.  Lehrer  and 
L.  Kitchen  (1986),  “Plausible  Reasoning  and 
the  Theory  of  Evidence.”  University  of  Mas¬ 
sachusetts  at  Amherst  COINS  Technical  Re¬ 
port  86-11,  April  1986. 

[Ros76j  A.  Rosenfeld,  R.  A.  Hummel  and  S.  W.  Zucker 
(1976),  “Scene  labeling  by  relaxation  opera¬ 
tions”,  IEEE  Trans.  Syst.  Man  Cybern.,  vol. 
SMC-6,  1976,  pp.  420-433. 

(Shaf73|  G.  Shafer  (1973),  “A  theory  of  statistical  ev¬ 
idence”,  pp.  365-434  in  Foundations  of  Prob¬ 
ability  Theory,  Statistical  Inference,  and  Sta¬ 
tistical  Theories  of  Science,  W.  L.  Harper  and 
C.  A.  Hooker  eds,  vol.  II. 

[Shaf76]  G.  Shafer  (1976),  A  Mathematical  Theory  of 
Evidence,  Princeton  University  Press,  1976. 

[Shaf84|  G.  Shafer  (1984),  “Probability  judgement  in 
artificial  intellegence  and  expert  systems”. 
Working  Paper  165,  School  of  Business,  The 
University  of  Kansas,  Lawrence,  Kansas,  Dec. 
1984. 

[Shaf85]  G.  Shafer  (1985),  “Belief  functions  and  possi¬ 
bility  measures”,  The  Analysis  of  Fuzzy  Infor¬ 
mation,  vol.  1,  J.  C.  Bezdek,  ed.,  CRC  Press. 

[Shaf83|  G.  Shafer  and  A.  Tversky  (1983),  “Weighing 
evidence:  the  design  and  comparison  of  proba¬ 
bility  thought  experiments”,  7 5th  Anniversary 
Colloquium  Series,  Harvard  Business  School. 

[Shor76|  Shortliffe,  E.  H.  (1976)  Computer-Based  Medi¬ 
cal  Consultations:  MYCIN.  New  York:  Amer¬ 
ican  Elsevier. 

[Smit6l]  0.  A.  B.  Smith  (1961),  “Consistency  in  statisi- 
cal  inference  and  decision”,  Journal  of  Royal 
Statisieal  Society,  Series  B,  vol  23,  1961,  pp. 
1-37. 

[Smit65j  C.  A.  B.  Smith  (1965),  “Personal  probability 
and  statistical  analysis”.  Journal  of  the  Royal 
Statisieal  Society,  Series  B,  vol.  128,  1965,  pp. 
469-499. 


[Strt84]  T.  Strat  (1984),  “Continuous  belief  functions 
for  evidential  reasoning",  AAAi-84,  pp.  308- 
313. 

[Wal72|  Waltz,  D.  (1972)  “Generating  Semantic  De¬ 
scriptions  from  Drawings  of  Scenes  with 
Shadows."  AI-TR-271.  Project  MAC,  Mas¬ 
sachusetts  Institute  of  Technology.  (Reprinted 
in  P.  Winston  (Ed.).  (1975)  The  Psycology 
of  Computer  Vision.  McGraw-Hill,  New  York, 
19-92.) 

[Weis86]  Weiss,  R.  and  M.  Boldt.  (1986)  “Geometric 
Grouping  Applied  to  Straight  Lines.”  Com¬ 
puter  Vision  and  Pattern  Recognition,  1986. 

(Wes83)  L.  Wesley  (1983),  “Reasoning  about  control: 

the  investigation  of  an  evidential  approach”, 
Proc.  Eighth  International  Joint  Conference 
on  Artificial  Intelligence,  Karlsruhe,  West 
Germany,  August  1983,  pp.  203-210. 

[Wes84]  L.  Wesley  (1984),  “Reasoning  about  control: 

an  evidential  approach”,  Tech.  Report  324, 
SRI  Artificial  Intelligence  Center,  1984. 

[Wes85j  L.  Wesley  (1985),  Ph.D.  Thesis,  University  of 
Massachusetts,  Amherst,  in  preparation. 

(Weym86)  Terry  E.  Weymouth.  Using  Object  Descrip¬ 
tions  in  a  Schema  Network  for  Machine  Vi¬ 
sion,  Technical  Report  86-24,  COINS  dept., 
Univ.  of  Massachusetts,  May  1986. 

[Wood78]  Woods,  W.  A.  "Theory  Formation  and  Con¬ 
trol  in  a  Speech  Understanding  System  with 
Extrapolations  Toward  Vision”,  in  Computer 
Visions  Systems,  Hanson  and  Rjseman,  eds., 
Academic  Press,  N.Y.,  1978.  pp  379-390. 

[Zad83|  L.  A.  Zadeh  (1983),  “The  role  of  fuzzy  logic  in 
the  management  of  uncertainly  in  expert  sys¬ 
tems”,  Fuzzy  Sets  and  Systems,  vol.  11,  1983, 
pp.  199-227. 


537 


GOAL-DIRECTED  CONTROL  OF  LOW-LEVEL 
PROCESSES  FOR  IMAGE  INTERPRETATION 


Charles  A.  Kohl  Allen  R.  Hanson 

Edward  M.  Riseman 

Department  of  Computer  and  Information  Science 
University  of  Massachusetts 
Amherst,  Massachusetts  0J003 


ABSTRACT 

The  task  of  image  interpretation  is  viewed  as  a  coordinated 
process  in  which  high-level  interpretation  processes  and  low- 
level  segmentation  processes  interact  through  what  is  known 
as  the  intermediate-level  of  processing.  Control  processes  at 
this  intermediate-level  respond  to  requests  from  the  interpre¬ 
tation  processes  which  are  expressed  in  terms  of  goals.  The 
presentation  of  a  request  for  low-level  processing,  which  is  rep¬ 
resented  by  the  posting  of  a  goal,  results  in  the  creation  of  an 
intermediate-level  control  process  which  is  an  instantiation  of 
a  control  structure  known  as  a  schema.  The  set  of  schemas 
represented  at  the  intermediate-level  define  the  types  of  goals 
which  may  be  processed  at  this  level. 

The  instantiations  of  the  schemas  utilize  knowledge  of  the 
image  domain  to  translate  the  goals  into  appropriate  low  level 
process  specifications  which  include  the  identification  of  the 
image  features,  algorithms,  sensitivity  settings,  or  rules  to  be 
used  by  the  process  to  accomplish  the  desired  task.  These 
low-level  process  specifications  are  then  passed  to  a  low  level 
process  controller  which  directs  the  execution  of  the  process 
and  then  returns  the  results  to  the  intermediate  level  control 
process. 

Using  this  control  paradigm,  high  level  interpretation  pro¬ 
cesses  gain  the  capability  to  create  or  refine  segment  at  ion  data 
according  to  predefined  goals  and/or  current  hypotheses  re¬ 
garding  the  content  of  the  image,  and  thereby  improve  the 
quality  of  the  overall  interpretation. 

1.  THE  INTERMEDIATE  LEVEL  OF  IMAGE 
INTERPRETATION 

Image  Interpretation  is  a  complex  process  t  hrough  which 
a  numeric  array,  representing  a  digitized  visual  scene,  can  be 
analyzed  to  provicie  a  semantic  description  of  the  scene  content. 
One  subgoal  of  the  interpretation  process  is  Image  segmen¬ 
tation,  the  low-level  process  by  which  the  digitized  image  is 
abstracted  into  a  set  of  primitive  elements  which  may  be  used 
by  interpretation  processes  as  the  basis  for  the  construction 
of  an  abstract  symbolic  model  of  the  original  scene.  Accord¬ 
ing  to  many  commonly  accepted  views  of  interpretation,  image 


*Thi»  work  was  supported  in  part  by  the  Air  Force  Office  of  Scientific 
Research  Grant  AFOSR-86-0021,  the  National  Science  Foundation  Grant 
DCR-8318776,  and  DARPA  Contract  DACA7n-8S-C-0008 


segmentation  is  simply  the  first  stage  of  the  interpretation  pro¬ 
cess.  We  take  the  view,  however,  that  segmentation  is  a  process 
which  does  not  exist  in  isolation,  but  rather  is  an  integral  part 
of  the  overall  image  interpretation  process. 

This  concept  of  the  interdependence  bet  ween  the  segmenta¬ 
tion  and  interpretation  processes  has  developed  slowly  over  the 
years  as  researchers  have  become  aware  of  the  difficulty  of  each 
of  the  phases  of  processing.  Researchers  such  as  Marr  (j'J f)|). 
Brooks  and  Binford  ((3|),  Thompson  (|3tl),  Reynolds  (|2fi|), 
Hwang  (jlfij)  and  Hanson  and  Riseman  (Jtitlj)  have  all  stressed 
the  need  for  interaction  between  high  and  low  level  processes. 

To  date,  however,  little  has  been  done  to  address  this  prob¬ 
lem,  and  thus  there  are  no  commonly  accepted  protocols  for 
the  interaction.  If  we  are  to  adequately  deal  with  interaction 
between  high  and  low  level  processes,  there  are  three  separate 
issues  which  must  be  resolved.  First,,  there  must  be  a  common 
knowledge  representation  which  is  applicable  t.o  all  levels  of  the 
interpretation  process;  intercommunication  between  processes 
at  different  levels  is  difficult  or  impossible  if  data  objects  and 
hypotheses  are  not  consistent  across  the  (somewhat,  artificial) 
boundaries  between  levels  Second,  there  is  the  necessity  for 
a  single  unified  control  mechanism  which  ran  coordinate  and 
schedule  the  activities  of  high  and  low  level  processing  tasks. 
Finally,  there  is  the  requirement  for  an  explicit  representation 
of  the  knowledge  necessary  for  the  effective  application  of  low- 
level  processes  (i.e.  the  selection  of  the  appropriate  algorithms, 
sensitivity  settings,  and  image  features).  For  example,  we  re¬ 
quire  a  mechanism  which  can  encode  the  knowledge  which  indi¬ 
cates  that  certain  image  features  and  algorithms  are  generally 
useful  when  segmenting  textured  image  areas,  while  others  are 
more  useful  when  segmenting  smooth  image  areas. 

In  this  report,  we  outline  the  design  of  a  system  which  has 
been  developed  within  the  VISIONS  Image  Understanding  Sys¬ 
tem  ((17))  to  deal  with  these  three  issues  and  accomplish  the 
integration  of  the  the  high  and  low  levels  of  the  image  un¬ 
derstanding  process.  The  system  is  designed  to  provide  the 
mechanism  by  which  high-level  requests  for  data  may  activate 
a  set  of  appropriate  low-level  processes  which  may  be  capable 
of  producing  the  desired  data.  The  specification  of  the  data  re¬ 
quest  is  in  the  form  of  a  goal,  a  data  structure  which  defines  the 
nature  of  the  image  abstractions  desired  by  the  requesting  pro¬ 
cess.  Attributes  stored  within  the  goal  data  structure,  known 


538 


as  constraints,  express  additional  information  which  may  be 
useful  in  the  specification  of  the  most  appropriate  low-level 
processes.  The  goal  constraints  express  the  desired  character¬ 
istics  of  the  data  to  be  produced,  and  may  be  represented  in 
either  semantic  or  syntactic  terms.  The  semantic  constraints 
are  defined  in  terms  of  semantic  labels  (e.g.  “segment  a  specific 
portion  of  the  image  to  separate  tree  and  sky”),  while  the  syn¬ 
tactic  constraints  are  expressed  in  terms  of  measurable  image 
features  (e.g.  “produce  regions  which  exhibit  homogeneous  tex¬ 
ture  measures’).  Through  the  use  of  the  information  expressed 
in  these  constraints,  the  system  is  able  to  select  the  most  ap¬ 
propriate  image  features,  algorithms,  sensitivity  settings,  rules, 
etc.,  for  the  specification  of  the  low-level  task. 

fn  VISIONS,  the  interface  between  the  high  and  low  levels 
of  the  interpretation  process  is  called  the  intermediate  level; 
data  at  this  level  is  represented  by  tokens.  Tokens  may  repre¬ 
sent  any  abstraction  of  the  raw  image  data  which  is  in  spatial 
registration  with  the  actual  data  and/or  groups  of  these  ab¬ 
stractions  which  have  been  formed  by  the  application  of  group¬ 
ing  operations  ((35,2,28)).  Thus,  intermediate  level  tokens  may 
represent  any  of  the  normal  types  of  segmentation  output,  such 
a  regions,  lines,  or  surfaces.  The  intermediate  level  is  distin¬ 
guished  from  the  low  level  in  that  there  is  no  explicit  repre¬ 
sentation  of  pixels,  and  from  the  high  level  by  the  requirement 
that  tokens  are  in  spatial  registration  with  the  image  data. 

The  GOLDIE  (Goal  Directed  Intermediate  -Level  Exectutive) 
system  (|16()  has  been  developed  as  a  mechanism  to  provide 
this  intermediate-level  control  of  low-level  processing.  The 
basic  control  structure  of  GOLDIE  is  the  schema,  a  declara¬ 
tive  specification  of  control  strategies  which  may  be  used  by 
the  system  to  satisfy  a  specific  goal.  Schemas  provide  a  flex¬ 
ible  and  extensible  control  structure  which  is  used  within  VI¬ 
SIONS  to  direct  both  the  high  and  intermediate  processing  lev¬ 
els  ((36,27,11)),  and  they  provide  a  natural  interface  between 
processes  at  each  of  these  levels.  GOLDIE  extends  this  concept 
of  schema-directed  control  to  the  low-level  processes. 

2.  THE  NEED  FOR  GOAL-DIRECTED  CON¬ 
TROL 

Low  level  segmentation  processing  is  in  many  ways  more  of 
an  art  than  a  science.  Despite  an  enormous  number  of  excel¬ 
lent  techniques  and  algorithms  (e.g.  see  (31, 14,1, 8, 10)),  no  gen¬ 
eral  methods  have  been  found  which  perform  adequately  across 
a  wide  variety  of  images.  Methods  such  as  region-growing 
( [  1 2j ) ,  histogram  cluster  labeling  (|22|),  thresholding  ( ( 1 8| ) , 
zero-crossings  ((13,20|),  and  edge  and  line  detection  ((5|)  have 
all  been  found  to  be  effective  within  restricted  domains,  hut 
rarely  demonstrate  the  robustness  necessary  for  generalized  im¬ 
age  interpretation. 

The  failure  of  general  segmentation  techniques  to  provide 
“good”  segmentations  can  be  traced  to  a  variety  of  causes.  In 
many  cases  the  image  data  itself  is  complex  because  of  the 
physical  situation  from  which  the  image  was  derived  and/or 
the  nature  of  the  scene.  Heavy  textures,  unconstrained  light¬ 


ing,  view  angle,  surface  reflectivities,  markings,  and  curvature 
all  conspire  to  produce  ambiguous  image  data.  This  results 
in  under-  or  over-merged  regions  in  the  region  segmentations 
and  missing,  fragmented,  or  incorrectly  merged  lines  in  the 
edge/line  segmentations.  The  type  of  error  often  depends  upon 


Figure  1:  Intensity  Image  of  Outdoor  Scene 

which  feature(s)  is  being  used  to  produce  the  segmentation  and 
whether  the  algorithm  has  a  global  or  local  view  of  the  image 
data  during  processing.  The  type  of  error  produced  quite  often 
varies  spatially  within  a  single  image,  depending  on  the  type 
of  data  in  each  locality  of  the  image. 

To  take  a  specific  example,  the  intensity  surface  of  an  image 
(Figure  1)  may  contain  a  great  deal  of  high  frequency  informa¬ 
tion  which  typically  will  lead  to  significant  fragmentation  in 
the  segmentation  such  as  shown  in  Figure  2.  Here  we  see 
that  the  region  segmentation,  which  was  produced  through  a 
zero-crossing  algorithm  ((20|),  has  produced  a  large  number  of 
small  regions  across  the  textured  projection  of  the  tree.  While 
these  regions  do  represent  image  areas  of  local  homogeneity, 
in  most  cases  they  are  of  little  semantic  interest.  If  our  over¬ 
all  interpretation  goal  is  to  identify  the  gross  objects  in  the 
scene,  a  segmentation  such  as  that  shown  in  Figure  3,  which  is 
produced  through  a  one-dimensional  histogram  clustering  al¬ 
gorithm,  would  be  much  more  useful  in  the  identification  of  the 
trees  in  the  image. 

In  the  rase  of  the  smoothly  varying  or  flat,  surfaces  of  this 
image,  however,  the  opposite  is  true.  Here,  the  slow  gradients 
of  the  intensity  surface  result  in  missing  object  boundaries  in  the 
segmentation  produced  by  the  clustering  algorithm,  resulting 
in  the  class  of  segmentation  error  known  as  over-merging.  An 
example  of  this  class  of  error  is  the  lack  of  a  boundary  between 
the  sky  and  house  wall  in  Figure  3. 

Existing  image  interpretation  systems  attempt  to  compen¬ 
sate  for  these  two  types  of  error  within  the  high-level  inter¬ 
pretation  process.  Undermerging  errors  are  typically  corrected 
with  grouping  processes  which  are  used  to  merge  adjacent  re- 


539 


gions  with  similar  labels  (]37,33j)  .  Overmerging  errors  are 
typically  bypassed  implicitly  through  the  use  of  a  segmenta¬ 
tion  which  has  such  a  fine  resolution  that  it  ran  be  assumed 
that  any  object  boundary  which  does  exist  in  the  image  has  a 
corresponding  region  boundary  in  the  image  (|9,19,25|). 

However,  since  both  types  of  errors  may  occur  simultane¬ 
ously  in  different  areas  of  any  given  segmentation,  it  is  our 
assertion  that  error  correction  properly  belongs  within  the  do¬ 
main  of  the  low-level  segmentation  processes  themselves.  As 
an  interpretation  process  discovers  problems  with  a  segmen¬ 
tation,  the  segmentation  itself  should  be  redefined  to  resolve 
those  problems.  This  redefinition  of  the  segmentation  may  re¬ 
quire  the  application  of  a  different  segmentation  algorithm,  or 
of  the  same  algorithm  with  different  sensitivity  settings,  over 
the  portion  of  the  image  which  presents  the  difficulties  for  the 
interpretation  process. 


Another  factor  leading  to  variability  in  low  level  segmen¬ 
tation  processing  has  to  do  with  the  choice  of  image  features 
over  which  the  segmentation  processes  will  run.  In  typical  RGB 
images,  a  large  number  of  computed  image  features  can  be  de¬ 
fined,  some  of  which  are  known  to  be  effective  in  segmentation 
processing  ([24,25,32,21]).  However,  in  a  large  number  of  seg¬ 
mentation  algorithms,  only  Intensity  (the  average  of  R,  G,  and 
B)  is  used.  In  other  algorithms,  the  selection  of  of  features  is 
typically  made  in  an  ad  hoc  fashion,  where  the  general  expe¬ 
rience  of  the  person  designing  the  segmentation  algorithm  is 
brought  into  play.  If,  however,  such  knowledge  and  experience 
were  represented  explicitly  in  the  system,  it  would  be  possi¬ 
ble  to  perform  this  selection  automatically  in  response  to  the 
overall  goals  of  the  interpretation  process. 

Finally,  there  is  the  problem  of  evaluating  different  region/line 
segmentations.  Despite  some  efforts  to  quantify  the  evaluation 
of  segmentation  processing  ( [23,26] ) ,  it  has  been  our  experi¬ 
ence  that  no  low-level  evaluation  measure  restricted  to  making 
measurements  on  the  segmentation  can  provide  a  useful  com¬ 
parative  metric.  Rather,  the  quality  of  a  segmentation  can  be 
measured  only  with  respect  to  the  goals  of  an  interpretation 
process  which  is  applied  to  the  segmentation  data.  Different 
criteria  must  be  used  to  evaluate  the  quality  of  segmentation 
results  on  highly  textured  image  areas  than  on  smoothly  vary¬ 
ing  image  areas. 

In  view  of  this  variability  in  the  selection  of  different  seg¬ 
mentation  methodologies  and  the  difficulties  of  evaluating  the 
results  of  various  techniques,  it  becomes  apparent  that  some 
form  of  intelligent  intermediate-level  control  is  necessary.  We 
believe  that  the  most  effective  form  for  this  control  is  a  mecha¬ 
nism  which  can  select  and  apply  the  most  appropriate  segmen¬ 
tation  process  for  each  distinct  area  of  the  image.  Through  the 
use  of  an  initial  region  segmentation  of  the  image,  these  distinct 
image  areas  may  be  coarsely  distinguished,  and  then  through 
an  iterative  process  of  resegmentation  and  merging  (using  ap¬ 
propriate  algorithms  and  rules),  the  areas  may  be  more  pre¬ 
cisely  identified.  In  other  words,  our  methodology  is  to  form 
a  hierarchical  set  of  segmentation  plans  in  which  the  results  of 
early  processing  are  combined  with  the  intermediate  results  of 
the  interpretation  process  and  used  to  guide  and  constrain  the 
later  processing. 

The  knowledge  required  under  this  control  paradigm  pro¬ 
vides  the  rules  and  information  by  which  the  goal  constraints 
may  be  used  to  determine  the  utility  of  the  various  low-level 
processes  available  to  the  system.  In  the  specification  of  these 
low  ievel  processes,  the  control  process  is  able  to  choose  among 
a  wide  array  of  possible  intermediate  and  low-level  tools  which 
include: 

•  a  set  or  parameterized  segmentation  algorithms, 

•  a  set,  of  low-level  image  analysis  and  enhancement  rou¬ 
tines, 

•  a  set  of  measurable  image  features, 


540 


Figure  4:  Overview  of  the  GOLDIE  System 


•  processes  for  accessing  and  modifying  the  tokens  and  their 
associated  attributes  in  the  intermediate-level  database, 
and 

•  a  formalized  specification  of  segmentation  goals. 

In  the  next  section,  we  demonstrate  the  mechanisms  by 
which  GOLDIE  represents  knowledge  of  these  processes  and 
tools,  and  how  it  uses  the  knowledge  to  control  low  level  pro¬ 
cessing. 

3.  INTERMEDIATE-LEVEL  KNOWLEDGE  AND 
CONTROL 

The  GOLDIE  system,  outlined  in  Figure  4,  may  be  de¬ 
scribed  in  terms  of  five  major  functional  components:  the  low- 
level  process  controller,  the  data  structures  for  intermediate- 
level  tokens  and  hypotheses  (ISTM),  the  representation  of  ex¬ 
plicit  intermediate-level  knowledge  (ILTM),  the  representation 
of  the  control  state  of  the  system  (Goal  Blackboard  and  Schema 
Instantiations),  and  the  actual  control  process  which  manages 
the  system  (represented  in  the  figure  by  the  arcs  between  the 
other  modules).  In  this  section,  we  will  present  a  brief  descrip¬ 
tion  of  each  of  these  components,  and  then  demonstrate  the 
mechanisms  by  which  the  components  are  combined  to  pro¬ 
duce  a  working  system. 

All  of  the  actual  image  processing  that  takes  place  within 
this  system  is  managed  by  the  low  -level  process  controller .  The 
controller  maintains  a  consistent  protocol  for  the  activation  of 
low  level  processing  tasks,  permitting  the  schema  representa¬ 
tion  of  low-level  processing  to  be  expressed  in  a  manner  inde¬ 
pendent  of  the  actual  procedural  implementation  of  these  tasks. 
Thus  the  addition,  deletion,  or  modification  of  low-level  pro¬ 
cessing  tasks  may  be  accomplished  with  minimal  modification 
of  the  schemas  which  control  the  tasks.  Since  many  low-level 
tasks  produce  intermediate-level  data  in  the  form  of  tokens, 


the  controller  is  also  responsible  for  the  maintenance  of  a  data 
structure  which  encodes  the  relations  between  tokens  and  the 
processes  by  which  they  were  created.  This  data  structure  pro¬ 
vides  the  data  necessary  for  backtracking  if  the  schemas  decide 
at  any  point  that  specific  segmentation  processing  should  be 
“undone” .  Thus  if  several  different  attempts  were  made  to  seg¬ 
ment  an  area  of  the  image,  but  only  one  of  these  was  found  to 
be  acceptable,  this  mechanism  allows  the  deletion  of  the  tokens 
produced  by  the  unacceptable  segmentation  processes. 

The  dynamic  data  structures  of  the  GOLDIE  system  are  the 
sets  of  tokens  and  hypotheses  about  tokens  which  are  stored 
in  ISTM  (Intermediate- level  Short  Term  Memory).  The  to¬ 
kens  themselves  are  defined  through  the  ISR  (Intermediate 
Symbolic  Representation),  a  combination  database  and  seman¬ 
tic  network  in  which  tokens  are  defined  as  graph  nodes  with 
associated  bitmaps  which  represent  the  spatial  extent  of  the 
token  in  the  image.  Attributes  of  the  tokens,  such  as  their 
compactness  measure  or  mean  feature  value,  are  computed  on 
demand  by  the  ISR  and  then  stored  in  the  database  for  effi¬ 
cient  retrieval  by  any  other  process  which  may  require  the  data. 
Relations  between  tokens,  such  as  boundary  contrast  between 
two  adjacent  region  tokens  for  a  given  image  feature,  are  rep¬ 
resented  as  arcs  which  have  values  indicating  the  nature  of  the 
relation.  Tokens  may  be  directly  accessed  by  name  or  associa- 
tively  accessed  through  the  specification  of  constraints  on  their 
attributes. 

ILTM  (Intermediate-level  Long  Term  Memory)  encodes 
the  image-independent  intermediate-level  knowledge  structures 
of  GOLDIE  which  includes  schemas,  object  domain  knowledge, 
and  general  knowledge  of  image  syntax.  These  structures  form 
the  core  of  the  system  in  that  they  contain  the  knowledge  which 
permits  the  intermediate-level  schemas  to  utilize  low  -level  pro¬ 
cesses  to  achieve  high-level  goals  without  explicit  high-level 
control. 


541 


As  was  stated  earlier,  the  schemas  of  the  [LTM  are  the 
modular  representation  of  the  set  of  strategies  which  may  be 
used  in  an  attempt  to  satisfy  the  processing  requests  we  call 
goals.  Each  schema  is  designed  to  serve  one  particular  type  of 
goal,  and  thus  there  is  a  one  to  one  correspondence  between 
the  goals  that  can  be  recognized  by  the  GOLDIE  system  and 
the  schemas  of  the  ILTM.  The  strategies  of  the  schemas  specify 
the  various  low  or  intermediate-level  tasks  through  which  the 
goal  may  potentially  be  satisfied.  The  execution  of  the  strate¬ 
gies  will  typically  involve  a  sequence  of  operations  which  may 
include  the  posting  of  subgoals,  the  execution  of  image  pro¬ 
cessing  tasks  through  the  low-level  process  controller,  and  the 
construction  of  hypotheses  based  on  data  represented  in  the 
ISTM.  Message  passing  constructs  allow  the  executing  strate¬ 
gies  to  communicate  and  synchronize  their  activities. 

Another  knowledge  structure  of  the  ILTM  is  the  object  do¬ 
main  knowledge  which  encodes  information  about  the  appear¬ 
ance  of  objects  in  an  image  which  can  be  useful  in  the  specifi¬ 
cation  of  segmentation  tasks.  This  knowledge  is  represented  as 
a  hierarchical  set  of  graph  nodes  corresponding  to  specific  se¬ 
mantic  objects  or  classes.  Associated  with  each  node  is  the  set 
of  features  which  have  been  experimentally  shown  to  best  dis¬ 
criminate  that  object  from  all  other  objects  in  a  training  set  of 
images.  Each  of  these  nodes  also  contains  heuristic  information 
in  the  form  of  a  syntactic  (i.e.  feature  based)  characterization 
of  the  object  (e  g.  textured,  smooth,  gradient),  the  segmenta¬ 
tion  resolution  at  which  the  object  may  be  best  extracted  (high, 
medium,  or  low),  and  a  set  of  rules  which  can  be  used  by  the 
intermediate-level  system  to  provide  crude  hypotheses  about 
the  possible  semantic  identity  of  a  particular  region  token 

ILTM  also  contains  the  system’s  knowledge  of  image  syn¬ 
tax.  This  knowledge  encodes  the  heuristic  information  whirh 
permits  the  evaluation  of  a  token,  without  any  use  of  semantic 
information,  to  create  a  hypothesis  as  to  whether  further  pro¬ 
cessing  is  required.  This  knowledge  include  the  set  of  types 
of  regions  (textured,  smooth,  gradient)  or  lines  (low-contrast, 
high-contrast,  horizontal,  long,  etc  ),  and  the  set  of  rules  to  be 
used  to  syntactically  evaluate  each  of  these  types  of  token.  The 
results  of  the  application  of  these  rules  are  used  to  create  syn¬ 
tactic  hypotheses  about  the  tokens  which  indicate  the  degree 


Region-Segmentation 
’((region-token  Region-CHXXM2) 
(objective  .  object-hypothesis-conflict) 
(evaluation  constraint  .  nil) 
(object-labels  .  (tree  grass)) 


Figure  5:  Goal  Node  Representation 

to  which  the  image  feature  data  supports  the  existence  of  the 
token  (e.g.  a  set  of  rules  which  are  to  be  used  to  determine 
whether  a  region  token  which  is  categorized  as  textured  should 
be  resegmented). 


Syntactic  knowledge  also  has  to  do  with  the  utility  of  spe¬ 
cific  image  features  and  algorithms  in  the  creation  of  tokens 
with  specific  characteristics.  For  example,  knowledge  struc¬ 
tures  in  the  ILTM  encode  experimentally  derived  information 
that  indicates  that  certain  features  and  algorithms  are  gener¬ 
ally  useful  when  segmenting  a  textured  region,  while  others  are 
useful  when  segmenting  smooth  or  gradient  regions. 

The  actual  control  of  processing  in  GOLDIE  is  managed 
through  another  semantic  network  which  represents  the  con¬ 
trol  state  of  the  system.  Goals  and  schema  instantiations  are 
represented  as  nodes  of  this  network,  and  arcs  represent  the 
relations  between  these  entities  ([36,7,6]).  The  goal  blackboard 
is  a  section  of  the  network  in  which  goal  nodes,  representing 
requests  for  intermediate-level  processing,  are  created  by  the 
schema  instance  that  requires  the  results  of  that  processing. 
Any  constraints  on  the  request  are  expressed  as  attributes  of 
the  goal  node.  By  using  an  attribute-value  list  to  express  these 
constraints  (Figure  5),  the  schema  instantiation  which  is  post¬ 
ing  the  goal  can  express  any  information  which  may  possibly 
be  of  use  to  the  responding  schema  instance.  The  schema  in¬ 
stance  which  responds  to  a  goal  will  extract  the  value  of  any 
named  attribute  which  it  can  utilize  in  processing. 

The  control  process  of  the  GOLDIE  system  continually  mon¬ 
itors  this  goal  blackboard  for  the  existence  of  new  goal  nodes. 

As  a  new  goal  is  observed,  the  control  process  creates  a  new  in¬ 
stance  of  the  corresponding  schema.  The  node  for  this  schema 
instance  is  linked  to  the  goal  node  by  an  arc  whirh  represents 
a  contract  by  the  instance  to  attempt  to  satisfy  the  goal.  The 
communication  between  invoking  and  invoked  schema  instances 
is  achieved  through  a  mailbox  on  the  goal  node.  When  the  in¬ 
voked  schema  instance  has  obtained  results  whirh  may  satisfy 
the  goal,  these  results  are  placed  in  the  mailbox  of  the  goal. 
The  invoking  schema  process  is  then  able  to  examine  the  re¬ 
sults  and  either  accept  them  as  satisfying  the  goal,  or  request 
that  additional  strategies  be  used  in  an  attempt  to  produce 
more  acceptable  data 

As  a  way  of  minimizing  the  number  of  unacceptable  re¬ 
sults,  many  schemas  are  capable  of  using  a  specific  type  of 
constraint  known  as  the  evaluation  constraint.  This  constraint 
is  expressed  as  a  function  value  pair  which  is  used  to  evalu¬ 
ate  results  prior  to  their  being  returned  to  the  invoking  schema 
instance,  if  the  value  returned  from  the  application  of  the  func¬ 
tion  to  a  potential  set  of  results  does  not  exceed  the  threshold 
value  expressed  in  the  constraint,  the  invoked  schema  instance 
will  automatically  continue  processing  until  a  set  of  results  is 
found  which  can  meet  this  constraint.  This  mechanism  makes  it 
possible  to  tune  the  schema  processing  to  highly  specific  goals. 

For  example,  an  evaluation  constraint  could  be  created  such 
that  the  region  segmentation  schema  would  return  only  seg¬ 
mentations  which  contained  regions  of  a  certain  shape  or  color, 
or  both. 

A  typical  scenario  illustrating  the  interaction  between  GOLDIE 
and  the  high-level  interpretation  system  could  involve  a  situa¬ 
tion  where  a  high-level  interpretation  process  had  formed  two 
conflicting  hypotheses  regarding  the  semantic  label  to  be  as¬ 
signed  to  a  particular  region  token.  Under  the  assumption 


54Z 


that  the  region  token  actually  represented  an  area  that  con¬ 
tained  two  different  objects,  the  interpretation  schema  instance 
would  post  a  goal  (Figure  5)  for  region  segmentation.  The  con¬ 
straints  on  the  goal  would  indicate  the  region  token  of  interest, 
and  would  additionally  indicate  that  the  reason  for  the  goal 
posting  was  that  there  was  an  object  hypothesis  conflict  for 
the  two  specified  object  labels. 

In  response  to  this  goal  posting,  the  system  would  activate 
an  instance  of  the  region  segmentation  schema.  This  schema 
instance  would  select  image  features  and  algorithms  appropri¬ 
ate  for  the  discrimination  of  the  two  objects  and  then  initiate  a 
segmentation  process  which  would  hopefully  split  the  original 
region  into  two  new  regions,  each  representing  one  of  the  two 
objects.  Tokens  corresponding  to  these  two  regions  would  then 
be  returned  to  the  interpretation  schema  instance  for  evalua¬ 
tion.  If  the  interpretation  schema  instance  were  satisfied  with 
the  results,  the  goal  and  region  segmentation  schema  instance 
would  be  deleted.  Otherwise,  the  schema  instance  would  con¬ 
tinue  processing  until  an  acceptable  segmentation  were  found 
or  until  the  strategies  of  the  region  segmentation  schema  in¬ 
stance  were  exhausted. 


4.  INTERMEDIATE-LEVEL  SCHEMAS 

The  set  of  intermediate-level  schemas  which  have  been  im¬ 
plemented  in  the  GOLDIE  system  are  shown  in  Table  1  and 
Figure  6.  Although  the  figure  implies  a  hierarchy  of  schemas 
in  the  system,  this  hierarchy  is  designed  only  to  demonstrate 
the  relationship  between  the  initialization  schema  and  the  other 
schemas  of  the  system.  We  will  use  this  hierarchy  as  the  basis 
for  our  discussion,  although  it  is  important  to  recognize  that 
the  actual  interaction  between  the  schemas  is  more  complex 
than  indicated  by  Figure  6. 

4.1  The  Initialization  schema 

At  the  highest  level  of  the  hierarchy  is  the  initialization 
schema.  This  schema  is  typically  intended  to  be  invoked  at 
system  startup  to  provide  the  initial  set  of  image  tokens  for  the 
interpretation  process.  Since  the  interpretation  schemas  of  the 
high-level  system  are  designed  to  operate  across  sets  of  tokens 
representing  image  abstractions,  no  interpretation  processing 
may  take  place  until  we  have  an  initial  segmentation  produced 
by  the  initialization  schema. 


Figure  8:  Schemas  of  the  GOLDIE  System 


543 


|  Schemas  of  the  GOLDIE  System  | 

Schema  Name 

Artion  ! 

initialization 

Obtain  the  “best”  segmentation  of  the  specified  region  token 
(possibly  the  whole  image)  with  respect  to  (possibly  initial) 
specified  constraints 

region  segmentation 

Segment  a  region  token 

region-algorithm 

Select  an  algorithm  for  region  segmentation 

region-  feature 

Select  an  image  feature  for  region  segmentation 

region  merge 

Merge  adjacent  region  tokens  which  meet  criteria  selected  as 
a  result  of  the  constraints 

syntactic  region  evaluation 

Evaluate  region  tokens  for  resegmentation  or  merging  with 
respect  to  specified  constraints 

semantic  region  evaluation 

Provide  rrnde  hypotheses  for  the  semantic  identity  of  the 
specified  region  tokens 

line  segmentation 

Segment  region  tokens  using  evidence  from  a  (set  of)  line 
token  (s) 

line  extraction 

Extract  line  tokens  from  image  data 

line -algorithm 

Select  an  algorithm  for  line  extraction 

line-feature 

Select  an  image  feature  for  line  extraction 

collinear  line  merge 

Merge  adjacent  line  tokens  which  meet  the  criteria  selected 
as  a  result  of  the  constraints 

Table  1:  Intermediate-level  Schemas  Descriptions 


The  intent  of  this  schema  is  to  produce  the  “best  possible” 
segmentation  of  a  region  token  without  any  assistance  from  in¬ 
terpretation  processes.  If  used  to  provide  the  initial  segmenta¬ 
tion  data,  the  only  constraints  on  the  satisfaction  of  the  initial¬ 
ization  goal  are  the  a  priori  constraints  of  the  overall  system. 
For  example,  if  we  were  implementing  a  system  whose  primary 
goal  was  to  identify  different  types  of  trees  in  outdoor  scenes, 
the  a  priori  constraints  on  the  goal  for  initialization  would  be 
“emphasize  region*  with  texture  characteristic *”  and  “empha¬ 
size  tree  object «” .  In  this  way,  the  initial  segmentation  which 
is  presented  to  the  interpretation  system  has  already  been  at 
least  partially  tuned  to  the  overall  goals  of  the  interpretation 
process. 

The  initialization  schema  directly  or  indirectly  makes  use 
of  all  of  the  other  intermediate-level  schemas  of  the  system  in 
this  attempt  to  provide  the  “best”  data-directed  segmentation, 
and  is  the  most  complex  schema  present  in  the  GOLDIE  sys¬ 
tem.  Although  the  schema  contains  only  a  single  strategy,  the 
execution  of  this  strategy  involves  region  segmentation,  region 
merging,  line  segmentation,  region  evaluation,  and  recursive 
invocation  to  achieve  the  desired  result. 

The  initialisation  schema  requires  a  region  token  as  input 
to  define  the  area  over  which  the  segmentation  is  to  take  place. 
In  the  case  where  the  schema  is  being  used  to  initialize  the 
system,  the  bitmap  for  this  region  token  is  defined  as  the  en¬ 
tire  image.  The  first  step  of  the  strategy  is  the  posting  of  a 


goal  for  region  segmentation  on  this  region  token.  Unless  over¬ 
ridden  by  its  own  goal  constraints,  the  initialization  schema 
instance  will  specify  an  evaluation  constraint  on  the  region 
segmentation  goal  which  makes  use  of  the  syntactic  evalua¬ 
tion  hypotheses  created  by  an  instance  of  the  syntactic  region 
evaluation  schema.  The  function  associated  with  this  evalu¬ 
ation  constraint  requires  that  the  majority  of  the  region  to¬ 
kens  produced  are  hypothesized  to  be  “good”  In  this  context, 
“goodness”  would  indicate  that  no  further  segmentation  was 
necessary,  and  thus  the  function  computes  a  value  based  in 
part  on  the  resegmentation  hypotheses  associated  with  each  of 
the  region  tokens  in  the  segmentation.  Obviously,  if  this  were 
the  only  factor,  the  best  segmentation  would  be  one  in  which 
each  pixel  were  represented  as  an  individual  region  token,  and 
thus  other  factors  sucli  as  the  number  of  region  tokens  are  also 
used  in  this  function.  The  specification  of  this  constraint  is 
an  attempt  to  balance  the  cost  of  further  segmentation  against 
the  cost  of  region  merging  to  produce  the  highest  quality  seg¬ 
mentation  at  the  lowest  computational  cost.  Figure  7  shows 
the  segmentation  selected  by  the  region  segmentation  schema 
according  to  this  evaluation  constraint  for  the  image  from  Fig¬ 
ure  1. 

Once  such  a  segmentation  has  been  obtained,  it  is  typically 
the  case  that  despite  good  overall  quality,  the  segmentation 
data  may  still  contain  many  obvious  (to  an  observer)  errors 
which  can  be  corrected  at  the  intermediate  level.  One  such 


544 


Figure  7:  Top-Level 
Segmentation  from 
Initialization  Schema 


category  of  error  is  the  possible  absence  of  significant  long  lines 
which  are  obvious  in  the  raw  data.  Thus  the  next  step  in  the 
initialization  strategy  is  to  post  a  line  segmentation  goal  for 
the  insertion  of  all  long,  high  contrast  lines.  In  this  process, 
which  is  somewhat  similar  to  the  process  employed  by  Nazif 
({23]),  a  set  of  line  tokens  are  extracted  from  the  raw  data  by 
an  instance  of  the  line  extraction  schema  and  the  long,  high 
contrast  lines  (Figure  8)  are  used  to  split  any  existing  region 
tokens  that  they  intersect  (Figure  9). 

The  execution  of  this  stage  of  the  strategy  attempts  to  as¬ 
sure  that  all  significant  discontinuities  in  the  raw  image  data 
are  represented  by  the  boundaries  of  region  tokens  in  the  seg¬ 
mentation,  but  fails  to  assure  that  all  diffuse  object  boundaries 
(i.c.  those  represented  by  slow  spatial  changes  in  feature  values) 


Figure  10:  Regions  with  Strong  Resegmen¬ 
tation  Hypotheses 

will  be  represented.  Therefore,  long  line  insertion  is  followed  by 
the  posting  of  a  goal  for  the  syntactic  evaluation  of  the  region 
tokens.  Through  the  execution  of  the  syntactic  region  evalu¬ 
ation  schema,  it  is  expected  that  when  a  given  region  token 
spans  such  a  diffuse  boundary,  the  evaluation  rules  used  by  the 
schema  will  produce  a  strong  resegmentation  hypothesis.  Fig¬ 
ure  10  indicates,  with  cross-hatching,  those  regions  for  which 
the  resegmentation  hypotheses  are  strong  enough  to  indicate 
a  need  for  resegmentation.  The  stronger  the  hypothesis,  the 
more  dense  the  cross-hatching. 

Using  the  goal  mechanism  of  the  system  to  recursively  in¬ 
voke  the  initialization  schema  over  the  set  of  region  tokens  tor 
which  such  a  hypothesis  exists,  the  segmentation  may  be  re¬ 
fined  in  the  image  areas  which  contain  these  diffuse  boundaries. 


545 


Since  the  image  data  is  progressively  more  localized  as  the  sys¬ 
tem  proceeds  with  this  recursive  invocation,  more  use  may  be 
made  of  local  context  to  focus  on  the  correct  algorithms  and 
features  for  segmentation.  For  example,  by  using  semantic  hy¬ 
potheses  produced  by  an  instance  of  the  semantic  evaluation 
schema,  the  instances  of  the  initialization  schema  are  able  con¬ 
strain  the  instances  of  the  region  segmentation  schema  which 
they  invoke.  Thus,  if  an  initialization  schema  were  invoked 
over  a  region  token  which  spanned  the  boundary  between  bush 
and  tree  (as  indicated  in  Figure  11),  the  constraining  knowledge 
that  both  semantic  objects  might  be  present  in  the  region  could 
help  the  associated  region  segmentation  schema  instance  to 
select  features  and  algorithms  which  could  best  discriminate 
these  two  objects.  Figure  12  demonstrates  the  nature  of  this 
process. 

In  this  case,  the  overmerged  region  on  the  left  of  Figure  10 
is  resegmented  using  the  hypotheses  for  the  existence  of  hoth 


Figure  lit  Tree/Bush  Region 


Figure  13:  Final  Result  from  Resegmenta¬ 
tion  of  TVee/Bush  Region 


bush  and  tree  in  the  region.  Based  on  this  knowledge,  the  image 
feature  which  is  selected  for  the  segmentation  task  is  a  rotation 
of  RGB  space  which  maximally  distinguishes  the  mean  RGB 
of  bush  objects  from  that  of  tree  objects  (according  to  data 
which  has  been  extracted  from  a  training  set  and  stored  in 
1LTM).  With  respect  to  the  original  image  in  Figure  l,  it  can 
be  seen  that  the  segmentation  in  Figure  12  has  distinguished 
tree  from  bush  and  has  captured  all  of  the  highlight/shadow 
boundaries  in  the  tree.  However,  the  segmentation  is  far  from 
ideal  in  that  the  two  different  types  of  bush  in  the  bottom  left 
have  not  been  distinguished,  and  also  in  the  fact  that  many 
of  the  tree  highlight/shadow  boundaries  are  quite  local  and 
semantically  unimportant.  But  since  the  area  defined  by  the 
original  tree/bush  region  (Figure  11)  is  being  processed  by  a 
complete  recursive  instance  of  the  initialization  schema,  the 
remaining  tasks  of  the  initialization  strategy,  including  long, 
high  contrast  line  insertion,  potential  recursive  invocation  of 


Figure  14:  Initialiiation  Schema  Segmen¬ 
tation  Prior  to  Final  Merge 


546 


the  initialization  schema,  and  region  merging  (see  below)  are 
used  to  complete  the  processing  on  this  area.  The  final  result 
which  is  returned  by  this  instance  of  the  initialization  schema 
(Figure  13)  identifies  all  major  semantic  boundaries  in  the  area 
without  a  great  deal  of  fragmentation.  Note  that  several  of 
the  small  regions  near  the  top  of  the  image  which  appear  to 
be  artifacts  of  fragmentation  were  not  defined  as  being  part  of 
the  resegmentation  region,  and  thus  may  not  be  merged  by  this 
particular  instance  of  the  initialization  schema. 

At  the  point  that  all  recursive  invocations  of  the  initial¬ 
ization  schema  have  completed,  and  all  region  tokens  in  the 
ISR  have  acceptably  low  resegmentation  hypotheses,  experi¬ 
ence  has  shown  that  the  updated  segmentation  typically  will 
still  display  some  degree  of  overfragmentation  (Figure  14).  The 
region  merge  schema  is  therefore  invoked  by  the  initialization 
schema  to  merge  all  adjacent  regions  which  exhibit  reasonable 
similarity  under  the  constraints  specified. 

Since  the  constraints  specified  in  the  original  instance  of  the 
initialization  schema  are  passed  down  into  any  goals  posted  by 
that  schema  instance,  this  entire  process  occurs  within  the  orig¬ 
inal  context.  Thus,  even  though  a  complex  chain  of  recursive 
invocations  of  the  initialization  schema  has  occurred,  the  top 
level  constraints  are  observed  at  all  levels  of  processing. 

The  final  segmentation  results  produced  by  the  initializa¬ 
tion  schema  (with  no  a  priori  constraints)  on  the  image  from 
Figure  1  are  shown  in  Figures  15  and  16.  This  segmentation 
appears  to  be  qualitatively  more  useful  than  either  of  the  two 
earlier  segmentations  which  were  produced  by  segmentation  al¬ 
gorithms  that  were  applied  globally  and  uniformly  across  the 
image.  The  types  of  errors  which  do  remain  (e  g.  overfrag¬ 
mentation  near  the  gutter  of  the  house,  and  ovcrmcrging  in 
the  windows)  are  generally  the  result  of  ambiguous  image  data 
and  would  require  high-level  interpretation  processing  to  re¬ 
solve  the  ambiguity  or  direct  the  additional  reorganization  of 
the  intermediate-level  tokens. 


Figure  IS:  Segmentation  from  Initialization  Schema 


Figure  16:  Image  with  Overlaid  Segmenta¬ 
tion  from  Initialization  Schema 


4.2  The  region  schemas 

The  intent  of  the  region  segmentation  schema  is  to  pro¬ 
duce  a  high  quality  segmentation  over  the  portion  of  t  he  image 
specified  by  a  region  token.  If  explicit  constraints  are  known, 
such  as  ' concentrate  on  regions  of  the  image  whirh  hove  the 
potential  for  being  interpreted  as  tree”,  the  schema  is  able  to 
use  strategies  which  constrain  the  segmentation  processes  such 
that  the  tokens  produced  should  meet  the  expectations  of  the 
invoking  higher  level  schema.  In  the  case  of  this  particular  goal 
constraint,  the  schema  would  make  use  of  syntactic  knowledge 
about  trees,  such  as  the  fact  that  they  are  present  in  textured 
areas  of  the  image  or  that  they  exhibit  certain  color  characteris¬ 
tics,  to  select  the  low-level  segmentation  task  which  would  best 
discriminate  tree  from  all  other  objects  present  in  t  he  image 

The  region  segmentation  schema  is  designed  to  work  accord¬ 
ing  to  a  hypothesize-and-test  paradigm.  The  schema  hypoth¬ 
esizes  appropriate  processing  parameters  such  as  image  fea¬ 
tures,  segmentation  algorithms,  and  sensitivity  settings,  and 
then  uses  these  parameters  to  control  the  application  of  spe¬ 
cific  low-level  image  processing  tasks  through  the  low  level 
process  controller.  The  test  portion  of  the  hypothesize  and 
test  paradigm  is  then  implemented  by  the  use  of  the  evalua¬ 
tion  constraint  associated  with  the  goal  If  the  test  fails,  the 
schema  instance  continues  to  hypothesize  new  features,  algo¬ 
rithms,  and  settings  until  the  test  succeeds.  If  no  segmentation 
is  found  which  meets  the  evaluation  constraint,  the  segmenta¬ 
tion  which  came  closest  to  meeting  the  constraint  is  returned. 
Figures  17  and  18  show  two  segmentations  for  the  tree/grass 
region  (Figure  11)  which  were  rejerted  according  to  this  mech¬ 
anism  prior  to  the  acceptance  of  the  segmentation  whirh  was 
shown  in  Figure  12.  With  respect  to  the  evaluation  constraint 
described  above,  the  segmentation  in  Figure  17  received  a  low 
score  due  to  a  need  for  significant  resegmentation,  and  the  low 


547 


Figure  17:  Region  Segmentation  Which 
Was  Rejected  By  Region  Seg¬ 
mentation  Schema 


Figure  18:  Region  Segmentation  Which 
Was  Rejected  By  Region  Seg¬ 
mentation  Schema 


evaluation  score  for  the  segmentation  in  Figure  18  was  due  to 
overfragmentation.  Note,  however,  that  given  a  different  evalu¬ 
ation  constraint,  either  of  these  segmentations  could  have  been 
chosen  over  that  of  Figure  12. 

Instances  of  the  region  segmentation  schema  make  use  of 
region  feature  and  region-algorithm  schemas  to  establish  hy¬ 
potheses  for  the  parameters  involved  in  the  specification  of  low- 
level  segmentation  tasks.  Within  the  region-feature  schema, 
initial  hypotheses  for  the  selection  of  an  appropriate  image  fea¬ 
ture  are  made  on  the  basis  of  the  specified  constraints  Thus, 
if  the  constraints  for  an  instance  of  this  schema  specified  an 
interest  in  tree  and  sky,  the  schema  would  use  the  knowl¬ 
edge  in  the  ILTM  to  hypothesize  a  set  of  features  which  would 
best  discriminate  regions  representing  these  two  objects.  The 
region-feature  schema  instance  would  then  refine  these  initial 
hypotheses  by  evaluating  the  actual  frequency  distribution  of 
each  of  these  features  over  the  area  of  interest.  Based  on  this 
evaluation,  the  hypotheses  would  be  ranked  and  returned  to 
the  invoking  schema  instance  in  order  of  hypothesized  util¬ 
ity.  If  none  of  this  initial  set  of  hypothesized  image  features 
were  acceptable  to  the  invoking  schema,  additional  less  restric¬ 
tive  strategies  would  then  be  employed  by  the  region-feature 
schema  instance  to  propose  additional  segmentation  features. 

If,  however,  no  constraints  wore  specified  on  the  goal,  the 
schema  instance  initially  hypothesizes  a  set  of  default  features 
which  are  known  to  typically  produce  good  results  across  the 
particular  image  domain.  It  would  be  this  set  of  features  which 
was  evaluated  and  returned  in  order  of  hypothesized  utility. 
Thus  we  see  that,  as  is  the  case  with  most  of  the  schemas  in 
this  system,  the  more  specific  the  information  provided  on  the 
goal  for  the  region  segmentation  schema  instance,  the  more 
specific  the  hypotheses. 

The  GOLDIE  system  currently  utilizes  a  set  of  43  image 
features  such  as  red,  green,  blue,  intensity,  local  deviation  of 
intensity,  hue,  etc.,  all  of  which  are  computed  from  the  origi¬ 


nal  RGB.  Given  the  mechanisms  by  which  these  features  are 
managed,  we  believe  that  it  would  be  a  straightforward  exten¬ 
sion  to  include  other  types  of  spatially  distributed  data  such  as 
infrared  or  range  data. 

The  region-algorithm  schema  functions  in  much  the  same 
way  as  the  region-feature  schema,  but  in  this  case  the  schema 
instance  does  not  actually  initiate  any  low-level  processes.  This 
schema  contains  a  set  of  strategies  for  proposing  segmentation 
algorithms  and  sensitivity  settings  which  are  based  on  the  na¬ 
ture  of  the  particular  image  feature  and  the  other  constraints. 
As  currently  implemented,  this  schema  is  able  to  select  between 
algorithms  for  global  one-dimensional  histogram  clustering,  lo¬ 
calized  one-dimensional  histogram  clustering,  zero-crossings, 
thresholding,  and  global  two-dimensional  histogram  clustering, 
each  of  which  can  be  used  with  a  variety  of  sensitivity  settings. 
Given  a  particular  image  feature,  or  set  of  image  features,  an 
instance  of  this  schema  is  able  to  hypothesize  the  algorithm 
and  sensitivity  setting  which  is  most  likely  to  produce  region 
tokens  with  the  desired  characteristics. 

The  schema  first  proposes  the  most  specific  algorithm  that 
it  can  hypothesize  under  the  constraints,  and  if  this  is  not  ac¬ 
cepted,  it  can  then  propose  progressively  more  general  algo¬ 
rithms.  For  example,  if  an  instance  of  this  schema  were  invoked 
with  the  constraints  that  the  region-feature  was  Deviation  (a 
textural  feature  which  tends  to  smooth  over  object  boundaries) 
and  the  object  label  of  interest  was  tree,  the  schema  instance 
would  first  propose  the  thresholding  algorithm  with  low  sensi¬ 
tivity.  This  algorithm  would  be  expected  to  distinguish  high 
texture  areas  (tree)  from  low-texture  regions  without  introduc¬ 
ing  excessive  fragmentation  in  either  area.  Additionally,  the 
thresholding  algorithm  would  be  expected  to  place  the  bound¬ 
ary  between  region  tokens  at  the  midpoint  of  the  image  feature 
gradient  of  the  boundary,  hopefully  restoring  the  blurred  ob¬ 
ject  boundary.  If  this  algorithm  was  found  to  be  unacceptable, 
the  schema  instance  would  then  propose  a  low  sensitivity  zero- 


| 


548 


crossing  algorithm.  If  this  were  also  rejected,  a  final  hypothesis 
of  low-sensitivity  one-dimensional  histogram  clustering  would 
be  proposed. 

Once  the  region  segmentation  schema  has  produced  an  ac¬ 
ceptable  set  of  region  tokens,  other  region  schemas  may  be  used 
to  evaluate  or  modify  these  tokens.  The  syntactic  region 
evaluation  schema  is  concerned  with  the  creation  of  ISTM 
hypotheses  about  region  tokens  which  indicate  a  belief  in  the 
“goodness”  of  a  particular  region  structure  (i.e.  the  degree  to 
which  the  existence  of  the  region  token  is  supported  by  the  un¬ 
derlying  image  feature  data).  Based  on  assumptions  expressed 
in  goal  constraints,  such  as  “we  are  interested  in  regions  which 
exhibit  slow  intensity  gradients”,  the  schema  selects  various 
subsets  of  the  evaluation  functions  which  are  stored  in  II, TM  in 
order  to  determine  whether  or  not  the  region  token  should  be 
merged  or  resegmented.  The  results  of  these  evaluation  func¬ 
tions  are  combined,  and  stored  in  ISTM  as  hypotheses  about 
the  region  tokens.  The  hypothesis  values  may  then  be  accessed 
by  other  schemas  of  the  system  which  will  take  the  appropriate 
actions. 

The  schema  for  the  semantic  region  evaluation  utilizes 
sets  of  rules  which  evaluate  the  characteristics  of  regions,  such 
as  short  line  density,  color,  etc.  to  assign  a  plausible  semantic 
label  to  a  region  (e  g.  sky  or  tree).  These  rules,  which  are 
stored  in  the  ILTM,  are  not  as  precise  or  complete  as  those 
used  by  a  complete  interpretation  system  ([2,30]),  but  provide 
valuable  information  for  the  intermediate-level  schemas  in  the 
case  that  no  overriding  hypotheses  have  been  specified  as  goal 
constraints. 

The  region  merge  schema  is  used  by  the  system  to  re¬ 
duce  potential  overfragmentation.  By  using  goal  constraints 
to  select  a  subset  of  the  evaluation  functions  stored  in  IbTM, 
instances  of  this  schema  are  able  to  evaluate  the  desirability  of 
a  merge  of  a  set  of  adjacent  regions,  and,  if  necessary,  direct 
the  actual  merge  process.  The  schema  may  either  be  applied 
in  a  general  fashion  over  the  entire  image,  evaluating  all  region 
tokens  in  the  ISR  for  merge  potential,  or  it  may  be  invoked 
to  directly  merge  a  set  of  adjacent  region  tokens  which  some 
high-level  process  has  deemed  to  be  similar.  Again  using  our 
example  of  a  system  with  the  a  priori  constraint  to  discriminate 
tree  and  sky,  an  instance  of  this  schema  would  select  two  sets 
of  merge  criteria  In  textured  areas  of  the  image,  regions  would 
be  merged  if  they  were  adjacent,  at  least  one  of  the  tokens  had 
a  strong  hypothesis  for  merge  potential,  and  they  both  exhib¬ 
ited  similar  textural  and  hue  characteristics.  In  non  textured 
areas  of  the  image,  the  merge  criteria  would  still  make  use  of 
adjacency  and  merge-hypothesis,  but  the  regions  would  also 
have  to  demonstrate  similar  low  values  of  intensity  deviation 
as  well  as  demonstrating  co-planar  intensity  surface  fits. 

4.3  The  line  schemas 

The  line  segmentation  schema  is  designed  to  either  di¬ 
rectly  insert  lines  into  a  region  representation  and  thereby  re¬ 
define  the  region  mapping,  or  to  control  a  region  segmentation 
process  which  is  designed  to  produce  region  boundaries  which 
show  a  high  degree  of  overlap  with  a  set  of  specified  lines.  The 


former  option  is  utilized  in  the  case  that  there  is  strong  belief 
in  the  existence  of  the  lines,  and  when  the  specifications  of  the 
lines  are  known  with  a  high  degree  of  accuracy.  Thus,  given  a 
set  of  lines  which  had  been  produced  through  the  line  extrac¬ 
tion  schema,  and  were  therefore  known  to  be  present  in  the  raw 
image  data,  the  lines  could  be  directly  inserted  into  the  region 
representation.  A  different  rationale  for  this  process  would  be 
a  situation  in  which  the  position  of  a  mobile  robot  was  known 
with  respect  to  a  road,  but  the  boundaries  of  the  road  were  not 
immediately  apparent  in  the  robot’s  image  data.  In  this  rase, 
the  road  boundaries  could  be  determined  from  an  internal  map 
and  then  be  placed  directly  into  the  segmentation  data  to  aid 
in  the  interpretation  of  the  remainder  of  the  image. 

This  type  of  behavior  by  the  schema  was  demonstrated  in 
Figure  9,  where  the  insertion  of  long  lines  restored  several  im¬ 
portant  boundaries  which  had  been  missed  in  the  original  re¬ 
gion  segmentation  from  Figure  7.  Since  the  lines  are  used  to 
split  all  regions  with  which  they  have  significant  intersection, 
several  artificial  boundaries  have  also  were  introduced  by  this 
process.  However,  as  was  shown  in  the  final  result  (Figure  15), 
the  region  merging  process  was  able  to  remove  most  boundaries 
which  did  not  have  a  basis  in  the  underlying  data. 

If  a  high-level  interpretation  schema  concerned  with  shape 
were  to  make  a  hypothesis  that  a  line  might  possibly  exist  in 
an  area  of  the  image,  this  type  of  direct  insertion  would  not 
be  desirable  (i.e.  the  hypothesis  might  be  incorrect).  In  this 
case,  the  characteristics  of  image  feature  distributions  across 
the  hypothesized  line  would  be  examined  in  an  attempt  to  find 
a  feature  which  could  be  used  by  a  region  segmentation  process 
to  produce  a  region  mapping  in  which  the  line  was  matched  by 
region  boundaries. 

The  line  tokens  which  are  utilized  by  these  schemas  are  ei¬ 
ther  created  as  hypothesized  lines  by  an  interpretation  schema 
or  are  extracted  from  the  data  by  an  instance  of  the  line  ex¬ 
traction  schema,  using  one  of  several  different,  straight  line  ex¬ 
traction  algorithms  ([4,35]).  The  schemas  for  line  algorithm 
and  line  feature  produce  the  necessary  hypotheses  for  an  in¬ 
stance  of  the  linp  extraction  schema,  and  these  two  schemas 
function  in  a  fashion  similar  to  their  analogues  for  region  seg¬ 
mentation.  Figure  19  shows  the  set  of  lines  produced  by  the 
line  extraction  schema,  from  which  the  set  of  long  high  contrast 
lines  in  Figure  8  were  selected. 

The  low  level  extraction  processes  controlled  by  instances 
of  this  schema  may  be  constrained  through  goal  constraints  to 
restrict  the  output  according  to  location,  line  orientation,  line 
length,  or  contrast  across  the  line.  Thus  a  high  level  interpre¬ 
tation  process  dealing  with  object  shape  which  needed  to  find  a 
particular  line  to  complete  a  rectangle  could  post  a  goal  to  find 
a  straight  line  in  the  image  data  which  is  similar  to  the  desired 
line.  On  the  other  hand,  a  schema  process  which  was  interested 
in  the  short  line  density  over  an  area  of  the  image  would  invoke 
this  schema  with  the  constraint  to  find  all  possible  lines.  The 
set  of  line  tokens  would  then  be  filtered  on  length,  and  perhaps 
contrast,  to  compute  the  density  measure. 

The  rolllnear  line  merge  schema  is  utilized  in  the  case 
that  a  desired  line  has  not  been  found  intact  by  the  line  extrac- 


549 


Figure  19:  Lines  Extracted  by  Line  Extrac¬ 
tion  Schema 

tion  processes.  The  low-level  process  controlled  by  this  schema 
examines  the  set  of  existing  line  tokens  which  have  been  pro¬ 
duced  by  the  line  extraction  process  to  determine  if  suitable 
line  tokens  may  be  merged  to  produce  the  desired  line  (|35|). 
Figure  20  shows  the  long  lines  created  by  the  application  of  the 
collinear  line  merge  schema  to  the  lines  of  Figure  8 

5.  CONCLUSION 

The  set  of  schemas  we  have  described  is  currently  imple¬ 
mented  within  the  GOLDIE  system  at  the  University  of  Mas¬ 
sachusetts,  and  research  is  proceeding  on  the  interface  between 
this  system  and  the  high  level  interpretation  schemas  being 
developed  by  other  researchers  in  the  VISIONS  group.  As 
currently  implemented,  the  GOLDIE  system  provides  an  in¬ 
complete,  yet  powerful  mechanism  for  the  control  of  low-level 
image  processing.  Future  development  of  the  system  to  add 
schemas  for  additional  low-level  processing  tasks,  and  to  im¬ 
prove  the  computational  efficiency,  will  produce  a  system  which 
dramatically  affects  the  overall  image  interpretation  process. 

There  are  four  major  aspects  of  GOLDIE  which  contribute 
to  the  utility  of  the  system.  Foremost  is  the  fact  that  the  sys¬ 
tem  is  goal-driven.  This  paradigm  provides  a  coherent  mech¬ 
anism  for  top  down  control  in  which  low-level  processing  may 
be  tuned  to  meet  the  expectations  of  higher  level  processes 
which  are  expressed  through  goal  constraints.  Second,  the  sys¬ 
tem  makes  use  of  a  unified  data  representation  which  allows 
processes  at  any  level  of  the  interpretation  hierarchy  to  create, 
access,  or  modify  the  data  which  represents  the  current  state  of 
interpretation  processing.  The  third  aspect  is  that  the  system 
contains  an  explicit  representation  of  knowledge  about  low- 
level  processes  and  the  image  domain  which  provides  a  flex¬ 
ibility  permitting  extension  to  additional  processes  or  image 
domains.  Finally,  the  GOLDIE  system  is  capable  of  operating 
within  a  hypothesize-and-test  protocol,  exploring  a  variety  of 
potential  solutions  to  high-level  requests  rather  than  operating 
in  a  strictly  deterministic  manner. 


REFERENCES 

|l|  Dana  H.  Ballard  and  Christopher  M.  Brown.  Computer  Vision. 
Prentice-Hall,  Inc.,  Englewood  Cliffs,  New  Jersey  07632,  1082. 

[2|  R.  Belknap,  E.  Riseman,  and  A.  Hanson  The  information  fusion 
problem  and  rule-based  hypotheses  applied  to  complex  aggre¬ 
gations  of  image  events.  In  Proceeding s  of  IEEFI  CVPH  ('on- 
ference ,  pages  227-234,  June  1986. 

[3 1  Rodney  A.  Brooks  and  Thomas  O.  Binford.  Interpretive  vision 
and  restriction  graphs.  In  Proceeding .«  of  The  First  Annual  Na¬ 
tional  Conference  on  Artificial  Intelligence ,  pages  21-27,  August 
1980. 

[4]  J.  B.  Burns,  A.  R.  Hanson,  and  E.  M.  Riseman.  Extracting 
straight  lines.  In  Proceedings  Seventh  International  Conference 
on  Pattern  Recognition,  pages  482  483,  Montreal,  July  1984. 

}5]  John  F.  Canny.  Finding  Edges  and  Line ?  in  Images.  Technical 
Report  720,  Artificial  Intelligence  Laboratory,  Massachusetts  In¬ 
stitute  of  Technology,  Cambridge,  Massachusetts,  June  1983. 

(6)  B.  Draper,  R.  Collins,  J.  Brolio,  J.  Griffith,  A.  Hanson,  and 
E.  Riseman.  Tools  and  experiments  in  knowledge  based  inter¬ 
pretation  of  road  scenes.  In  Proceedings:  Image  Understanding 
Workshop ,  1987.  To  be  published. 

|7|  B.  Draper,  A.  Hanson,  and  E.  Riseman.  A  Software  Environ¬ 
ment  for  High  Level  Vision.  Technical  Report,  Computer  and 
Information  Science  Department,  Univeisity  of  Massachusetts 
at  Amherst,  1980.  In  preparation. 

[8j  Richard  O.  Duda  and  Peter  E.  Hart.  Pattern  Classification  and 
Scene  Analysis.  John  Wiley  and  Sons,  New  York,  NY,  1973. 

(9|  Joey  Griffith,  Allen  Hanson,  and  Edward  Riseman.  A  rule  based 
region  merging  system.  In  Preparation. 

[  1 0|  Allen  R.  Hanson  and  Edward  M.  Riseman,  editors,  ('omptifer 
Vision  Systems.  Academic  Press  Inc.,  New  York,  NY,  1978. 


550 


(11)  Allen  R.  Hanson  and  Edward  M.  Riaeman.  Visions  image  un¬ 
derstanding  system  -  1966.  In  Christopher  M.  Brown,  editor, 
Advances  in  Computer  Vision,  Erlbaum  Assoc.,  1967.  To  be 
published. 

(12)  Allen  R.  Hanson,  Edward  M.  Riseman,  and  Paul  R.  Nagin. 
Region-growing  in  Textured  Outdoor  Scenes.  Technical  Re¬ 
port  75C-3,  Computer  and  Information  Science  University  of 
Massachusetts  at  Amherst,  Amherst  Massachusetts  01003, 1982. 

(13)  Robert  M.  Haralkk.  Digital  step  edges  from  aero  crossing  of 
second  directional  derivatives.  IEEE  Trans  on  Pattern  Analysis 
and  Machine  Intelligence,  PAMI-8(l):58-68,  January  1984. 

(14)  Robert  M.  Haralick  and  Linda  G.  Shapiro.  Image  segmentation 
techniques.  Computer  Vision,  Graphics,  and  Image  Processing, 
29(  1 ):  100- 1 32,  January  1985. 

[  15)  Vincent  Shang-Shouq  Hwang,  Larry  S.  Davis,  and  Takaslii  Mat¬ 
suyama.  Hypothesis  Integration  in  Image  Understanding  Sys¬ 
tems.  Technical  Report  TR-1137,  Center  for  Automation  Re¬ 
search,  University  of  Maryland,  College  Park,  Maryland  20742, 
June  1985. 

1 16)  Charles  A.  Kohl.  Goal  Directed  Image  Segmentation.  PhD 
thesis,  Computer  and  Information  Science  Department,  Univer¬ 
sity  of  Massachusetts,  Amherst,  Massachusetts  01003,  1987.  In 
Preparation. 

1 17)  R.  R.  Kohler  and  A.  R.  Hanson.  The  visions  image  operating 
system.  In  Proceedings  Sixth  International  Conference  on  Pat¬ 
tern  Recognition,  pages  71-74,  Munich,  October  1982. 

(18)  Ralf  Kohler.  A  segmentation  system  based  on  thresholding. 
Computer  Graphics  and  Image  Processing,  (15):319-338,  1981. 

1 19)  Martin  D.  Levine  and  Samir  I.  Shaheen.  A  modular  com¬ 
puter  vision  system  for  picture  segmentation  and  interpretation. 
Pattern  Analysis  and  Machine  Intelligence,  PAMI-3(5):540-554, 
September  1981. 

(20)  David  MarT.  Vision.  W.  H.  FVeeman,  San  FVaucisco,  CA,  1982. 

(21)  M.  Nagao  and  T.  Matsuyama.  A  Structural  Analysis  of  Complex 
Aerial  Photographs.  Plenum,  New  York,  NY,  1980. 

|22)  Paul  A.  Nagin.  Studies  in  Image  Segmentation  Algorithms 
Based  on  Histogram  Clustering  and  Relaxation.  Technical  Re¬ 
port  79-15,  Computer  and  Information  Science  University  of 
Massachusetts  at  Amherst,  Amherst  Massachusetts  01003, 1979. 

(23|  Ahmed  M.  Nasif.  A  Rule-Based  Expert  System  for  Image  Seg¬ 
mentation.  PhD  thesis,  Electrical  Engineering  Department, 
McGill  University,  Montreal,  Canada,  March  1983. 

(24)  Ronald  B.  Ohlander.  Analysis  of  Natural  Scenes.  PhD  thesis, 
Department  of  Computer  Science,  Carnegie-Mellon  University, 
Pittsburgh,  PA  15213,  April  1975. 

1 25]  Yu-ichi  Ohta.  Knowledge-hosed  Interpretation  of  Outdoor  Nat¬ 
ural  Color  Scenes.  Pitman  Advanced  Publishing  Program, 
Boston,  MA,  1985. 

|26|  Yu-ichi  Ohta.  A  Region-Oriented  Image -Analysis  System  hy 
Computer.  PhD  thesis,  Kyoto  University  Department  of  Com¬ 
puter  Science,  1980. 


(27)  Cesare  C.  Parma,  Allen  R.  Hanson,  and  Edward  M.  Rise- 
man.  Experiments  in  Schema-Driven  Interpretation  of  A  Natu¬ 
ral  Scene.  Technical  Report  80-10,  Computer  and  Information 
Science  Department  -  University  of  Massachusetts  at  Amin  st, 
Amherst,  Massachusetts  01003,  April  198. 

|28)  George  Reynolds  and  J.  Ross  Beveridge.  Searching  for  geometric 
structure  in  images  of  natural  scenes.  In  Proceedings:  Image 
Understanding  Workshop,  1987.  To  be  published. 

[29]  George  Reynolds,  Nancy  Irwin,  Allen  Hanson,  and  Edward  Rise- 
man.  Hierarchical  knowledge-directed  object  extraction  using  a 
combined  region  and  line  representation.  In  IEEE  I9Sf  Pro¬ 
ceedings  Of  The  Workshop  On  Computer  Vision  Representation 
And  Control,  pages  238-247,  IEEE,  1984. 

)30]  Edward  M.  Riseman  and  Allen  R.  Hanson.  A  Methodology  for 
the  Development  of  General  Knowledge-Based  Vision  Systems. 
Technical  Report  86-27,  Computer  and  Information  Science  Uni¬ 
versity  of  Massachusetts  at  Amherst,  Amherst  Massachusetts 
01003,  1986. 

|31]  A.  Rosenfeld  and  A.  Kak.  Digital  Picture  Processing  Academic 
Press,  New  York,  NY,  1976. 

(32)  Bert  Shaw.  Some  Remarks  on  the  Use  of  Color  for  Machine 
Vision.  Technical  Report  83-31,  Computer  and  Information 
Science  Department  -  University  of  Massachusetts  at  Amherst, 
Amherst,  Massachusetts  01003,  September  1983. 

|33]  J.  M.  Tenenbanm  and  H.  G.  Barrow.  Experiments  in 
Interpretation- Guided  Segmentation.  Technical  Report  123,  SRI 
International,  333  Ravenswood  Ave.,  Melno  Park,  CA  94025, 
March  1976. 

[34]  William  B.  Thompson  and  Albert  Yonas.  What  should  be  com¬ 
puted  in  low  level  vision.  In  Proceedings  of  The  First  Annual 
National  Conference  on  Artificial  Intelligence,  pages  7-10,  Au¬ 
gust  1980. 

(35)  Richard  Weiss  and  Michael  Boldt.  Geometric  grouping  applied 
to  straight  lines.  In  Proceedings  of  IEEE-CVPR  Conference , 
pages  489-495,  June  1986. 

[36]  Terry  E.  Weymouth.  Using  Object  Descriptions  in  a  Schema 
Network  for  Machine  Vision.  PhD  thesis,  University  of  Mas¬ 
sachusetts,  Amherst,  Massachusetts  01003,  1986. 

(37)  Y.  Yakimovsky.  Scene  Analysis  Using  a  Semantic  Base  for  Re 
gion  Growing.  PhD  thesis,  Stanford  University,  Palo  Alto,  CA, 
1973. 


551 


ACTIVE  VISION 

John  (Yiannis)  Aloimonos 
Isaac  Weiss 

Center  for  Automation  Research 
University  of  Maryland 
College  Park,  MD  20742 
Amit  Bandyopadhyay 
Department  of  Computer  Science 
SUNY  Stony  Brook 
Stony  Brook,  NY  11790 


ABSTRACT 

We  investigate  several  basic  problems  in  vision  under 
the  assumption  that  the  observer  is  active.  An  observer 
is  called  active  when  engaged  in  some  kind  of  activity 
whose  purpose  is  to  control  the  geometric  parameters  of 
the  sensory  apparatus.  The  purpose  of  the  activity  is  to 
manipulate  the  constraints  underlying  the  observed 
phenomena  in  order  to  improve  the  quality  of  the  percep¬ 
tual  results.  For  example  a  monocular  observer  that 
moves  with  a  known  or  unknown  motion  or  a  binocular 
observer  that  can  rotate  his  eyes  and  track  environmental 
objects  are  just  two  examples  of  an  observer  that  we  call 
active.  We  prove  that  an  active  observer  can  solve  basic 
vision  problems  in  a  much  more  efficient  way  than  a  pas¬ 
sive  one.  Problems  that  are  ill-posed,  nonlinear  or 
unstable  for  a  passive  observer  become  well-posed,  linear 
or  stable  for  an  active  observer.  In  particular,  the  prob¬ 
lems  of  shape  from  shading  and  depth  computation, 
shape  from  contour,  shape  from  texture  and  structure 
from  motion  are  shown  to  be  much  easier  for  an  active 
observer  than  for  a  passive  one.  It  has  to  be  emphasized 
that  correspondence  is  not  used  in  our  approach,  i.e., 
active  vision  is  not  correspondence  of  features  from  multi¬ 
ple  viewpoints.  Finally,  active  vision  here  does  not  mean 
active  sensing. 

1.  INTRODUCTION  AND  MOTIVATION 

Most  past  and  present  research  in  machine  percep¬ 
tion  has  involved  analysis  of  passively  sampled  data 
(images).  Human  perception,  however,  is  not  passive,  it 
is  active.  Perceptual  activity  is  exploratory  and  search¬ 
ing.  When  humans  see  and  understand,  they  actively 
look.  In  the  process  of  looking,  their  eyes  adjust  to  the 
level  of  illumination,  focus  on  certain  things,  converge  or 
diverge  and  their  heads  move  to  obtain  a  better  view  of 
the  scene. 

It  is  very  natural  to  ask  why  human  observers 
operate  in  such  a  way,  because  certainly  humans  are  very 
efficient  in  visual  tasks.  In  other  words,  how  does  the 
fact  that  an  observer  is  active  affect  the  levels  of  a  visual 
system  (as  described  by  Marr,  1982),  namely  computa¬ 
tional  theory,  representations  and  processing  algorithms 
and  implementation.  Does  an  active  observer  have  any 


advantage  over  a  passive  observer,  in  any  computational 
theoretic,  algorithmic  or  implementational  way?  In  this 
research,  we  examine  this  question  and  we  find  that 
indeed  an  active  observer  has  a  great  deal  of  advantage 
as  far  as  the  first  two  levels  of  the  visual  system  are  con¬ 
cerned  (computational  theory,  algorithms).  A  natural 
way  to  examine  this  question  is  to  study  basic  problems 
of  vision  whose  solutions  demonstrate  visual  abilities, 
avoiding  in  this  way  the  potential  philosophical  snare  of 
getting  into  a  discussion  of  the  vision  problem  in  general. 

Another  motivation  for  examining  active  vision  is 
the  fact  that  passive  vision -has  been  shown  to  be  very 
problematic.  Almost  every  basic  problem  in  passive 
machine  perception  is  very  difficult,  because  it  is  ill-posed 
in  the  sense  of  Hadamard  (1923).  So,  because  there  does 
not  exist  a  unique  solution,  the  problem  has  to  be  regu¬ 
larized,  by  imposing  additional  constraints,  which  should 
be  physically  plausible.  An  example  of  such  an  additional 
constraint  is  some  kind  of  smoothness  of  the  unknown 
functions.  There  has  been  excellent  research  in  regulariz¬ 
ing  early  vision  problems,  originated  in  (Poggio  et  al., 
1986;  Poggio  and  Koch,  1985).  Even  though  the  regulari¬ 
zation  paradigm  is  very  attractive  for  its  mathematical 
elegance  and  for  being  a  legitimization  of  already  pub¬ 
lished  research  (Horn  and  Schunck,  1981;  Ikeuchi  and 
Horn,  1981;  Hildreth,  1984),  it  has  some  shortcomings,  in 
the  sense  that  it  cannot  deal  with  the  full  complexity  of 
vision.  One  problem  is  the  degree  of  smoothness  required 
for  the  unknown  function  that  has  to  be  recovered;  for 
example,  some  unrealistic  results  have  been  reported  in 
surface  interpolation,  because  depth  discontinuities  are 
smoothed  too  much.  Research  on  regularization  in  '  he 
presence  of  discontinuities,  while  pioneering,  is  still 
premature  (Terzopoulos,  1984;  Lee  and  Pavlidis,  1986). 
Another  problem  is  that  standard  regularization  theory 
deals  with  linear  problems  and  is  based  on  quadratic  sta¬ 
bilizers.  In  the  case  of  nonquadratic  functionals  standard 
regularization  theory  may  be  used,  but  the  situation  is 
problematic  (Morozov,  1984).  For  non-quadratic  func¬ 
tionals,  the  search  space  may  have  many  local  minima 
and  in  this  case  only  stochastic  algorithms  might  have 
some  success  (Kirkpatrick  et  al.,  1984).  Aside  from  the 
fact  that  most  passive  vision  problems  are  ill-posed,  some 
well  posed  problems  are  very  unstable.  That  is  to  say, 
even  if  from  the  physical  constraints  the  problem  is 


552 


shown  to  have  a  unique  solution,  finding  this  solution  is 
very  difficult  and  unstable,  in  the  sense  that  a  small  error 
in  the  input  of  the  perceptual  process  can  create  catas¬ 
trophic  results  in  the  output.  An  example  of  such  a  prob¬ 
lem  is  the  passive  navigation  problem  (Ullman,  1977;  Tsai 
and  Huang,  1984;  Bruss  and  Horn,  1984;  Waxman  and 
Ullman,  1985;  Longuet-Higgins,  1981;  Longuet-Higgins 
and  Prazdny,  1982),  where  retinal  motion  (retinal  veloci¬ 
ties  or  displacements)  are  used  as  the  input.  In  contrast, 
it  will  be  shown  that  some  problems  that  are  ill-posed  for 
a  passive  observer  become  well-posed  for  an  active  one, 
and  problems  that  are  unstable  become  stable. 

We  now  set  out  to  examine  the  advantages  of  an 
active  observer  with  respect  to  the  computational  theory 
and  algorithms  levels  of  visual  systems.  In  particular  we 
show  that: 

(1)  The  problem  of  shape  from  shading  in  the  case  of  a 
passive  observer  has  infinitely  many  solutions  and  addi¬ 
tional  assumptions  are  required  to  guarantee  uniqueness. 
Furthermore,  the  stability  of  the  developed  algorithms  is 
in  question.  In  contrast,  in  the  case  of  an  active  observer, 
the  shape  from  shading  problem  is  shown  to  have  a 
unique  solution.  In  particular  the  commonly  used 
assumptions  of  smoothness  of  the  visible  surface  and  con¬ 
stant  albedo  become  unnecessary.  This  makes  it  possible 
to  deal  with  complex  shapes  having  discontinuities  of  sur¬ 
face  orientation  and  reflectance.  The  isotropic  radiance 
constraint  can  be  relaxed  to  deal  with  partly  specular  sur¬ 
faces.  Our  method  is  not  susceptible  or  prone  to  instabil¬ 
ities.  Depth  computation  is  addressed  also. 

(2)  The  problem  of  shape  from  contour  is  a  difficult  one 
for  a  passive  observer  since  assumptions  have  to  be 
employed  to  obtain  a  unique  solution.  We  show  that  in 
the  case  of  an  active  observer  the  problem  has  a  unique 
solution,  which,  moreover,  can  be  found  using  linear 
equations.  The  stability  of  the  proposed  computation  is 
also  examined. 

(3)  The  problem  of  shape  from  texture  requires  assump¬ 
tions  about  the  texture  to  make  it  solvable  by  a  passive 
observer.  We  prove  that  an  active  observer  can  recover 
shape  from  texture  without  any  assumptions  and  using 
linear  equations. 

(4)  Finally,  the  problem  of  structure  from  motion  has 
been  shown  to  be  very  unstable.  We  show  that  an  active 
observer  can  recover  structure  from  motion  by  using 
linear  equations. 

The  following  table  compares  the  performance  of  a 
passive  and  an  active  observer  in  the  solution  of  several 
basic  problems. 

In  the  following  sections  we  study  the  basic  problems 
described  above.  The  case  of  the  computation  of  optic 
flow  will  be  described  in  subsequent  publications.  In  all 
cases,  we  try  to  avoid  solving  the  correspondence  prob¬ 
lem.  Also,  at  this  point  it  should  be  clear  that  active 
vision  is  not  active  sensing;  it  is  just  vision  in  which  phy¬ 
sical  constraints  are  simplified  because  the  observer  can 
change  state  in  an  active  way. 


2.  SHAPE  FROM  SHADING 
2.1.  Introduction 

In  the  problem  of  recovering  shape  from  shading,  the 
input  consists  of  the  brightness  at  each  point  of  an  image, 
or  images,  and  the  desired  output  is  the  depth  and/or  the 
surface  normal  of  the  corresponding  point  on  the  visible 
surface.  In  principle,  the  depth  map  (the  depth  z  as  a 
function  of  the  x,y  coordinates)  contains  all  the  informa¬ 
tion  about  the  surface  and  the  surface  normals  can  be 
computed  directly  from  the  knowledge  of  the  depth  map. 
In  practice,  however,  those  depths  cannot  be  derived  with 
sufficient  accuracy  for  calculating  the  normals  and  one 
would  like  to  infer  the  normals  directly  from  the  image. 
In  the  following  we  briefly  discuss  the  current  approaches 
to  solving  the  problem  and  their  severe  limitations,  and 
then  show  how  the  active  vision  paradigm  overcomes 
these  limitations. 

The  simplest  approach  to  the  shape  from  shading 
problem  involves  using  one  image  of  the  surface.  To  find 
a  solution,  the  following  assumptions  are  commonly  used, 
none  of  which  is  particularly  valid  in  a  realistic  situation: 

1)  The  surface  is  smooth 

2)  The  surface  reflectance  characteristics,  usually 
Lambertian,  are  the  same  throughout  the  surface. 

3)  The  lighting,  usually  one  point  light  source,  is  the 
same  throughout  the  surface. 

4)  The  image  is  nearly  noise-free. 

Based  on  these  assumptions,  one  can  write  a  func¬ 
tional  of  the  surface  and  its  normals,  which  should  be 
minimal  for  the  correct  surface.  The  functional  is  in  gen¬ 
eral  a  non-linear  function  of  a  large  number  of  unknowns 
(depth  and  normals  at  each  point),  so  it  is  very  hard  to 
achieve  convergence  of  the  numerical  optimization  to  the 
global  minimum.  Very  good  research  along  this  line  is 
described  in  Horn  (1977)  and  Ikeuchi  and  Horn  (1981). 
But  in  this  case,  noise  compounds  the  problem,  creating 
instabilities.  So  these  techniques  have  had  only  very  lim¬ 
ited  success. 

An  improvement  can  be  achieved  by  using  two 
images  of  the  surface.  Combining  information  from  the 
two  images  makes  it  easier  to  solve  the  minimization 
problem.  In  this  case  one  does  not  need  to  consider  the 
whole  image  at  once  during  the  minimization,  as  in  the 
previous  case,  because  the  image  can  be  decoupled  into 
narrow  strips  along  epipolar  lines.  Only  points  along  a 
pair  of  such  lines  need  be  considered  at  the  same  time. 
However,  to  be  able  to  take  advantage  of  the  two  views, 
one  must  first  find  the  correct  correspondence  between 
the  two  images.  One  can  distinguish  between  two 
methods  in  using  more  than  one  image: 
a)  The  two  cameras  are  very  close  to  each  other.  In  this 
case  it  is  easy  to  establish  a  correspondence  between 
points  in  the  two  images.  Moreover,  the  equation  one 

needs  to  solve  are  linear,  because  the  small  distance 
allows  use  of  first  order  Taylor  expansion  of  the  vari¬ 
ous  functions  involved.  However,  the  small  baseline 


553 


Table  1 


Problem 

Passive  Observer 

Active  Observer 

Shape  from  shading 

Ill- posed  problem.  Needs 
to  be  regularized.  Even 
then,  unique  solution  is  not 
guaranteed  because  of  non¬ 
linearity. 

Well-posed  problem. 

Unique  solution.  Linear 
equation  used.  Stability. 

Shape  from  contour 

Dl-posed  problem.  Has  not 
been  regularized  up  to  now 
in  the  Tichonov  sense. 
Solvable  under  restrictive 
assumptions. 

Well-posed  problem. 

Unique  solution  for  both 
monocular  or  binocular  ob¬ 
server. 

Shape  from  texture 

Ill- posed  problem.  Needs 
some  assumption  about  the 
texture. 

Well  posed  problem.  No 
assumption  required. 

Structure  from  motion 

Well  posed  but  unstable. 
Nonlinear  constraints. 

Well  posed  and  stable. 
Quadratic  constraints,  sim¬ 
ple  solution  methods,  sta¬ 
bility. 

Optic  flow  (area 
based) 

Ill-posed.  Needs  to  be  reg¬ 
ularized.  The  introduced 
smoothness  might  produce 
erroneous  results. 

Well  posed  problem. 

Unique  solution.  Might  be 
unstable. 

between  the  cameras  severely  limits  the  accuracy  of 
the  method. 

b)  The  cameras  are  far  apart.  This  leads  to  more  accu¬ 
rate  results,  provided  one  can  solve  the  correspon¬ 
dence  problem.  This  problem  has  proven  to  be  very 
difficult,  and  the  techniques  that  deal  with  it  are  far 
from  satisfactory  (for  details  see,  e.g.  Horn  (1986). 

The  AV  (Active  Vision)  paradigm  has  the  advan¬ 
tages  of  both  methods,  without  their  shortcomings.  First, 
having  multiple  viewpoints  can  solve  the  correspondence 
problem.  In  fact,  it  can  be  shown  that  three  cameras  are 
enough  to  resolve  most  of  the  correspondence  ambiguities 
in  the  Lambertian  case.  The  stability  and  reliability  also 
increase.  But  multiple  viewpoints  are  not  enough  by 
themselves  to  make  a  method  work.  This  is  because  we 
again  run  up  against  the  problem  of  non-linear  optimizer 
tion,  which,  with  so  many  variables,  rarely  converges  to 
the  global  minimum  without  a  very  good  initial  guess. 
The  key  to  the  success  of  AV  is  the  fusion  of  the  long- 
and  short-baseline  methods.  We  have  available  images 
taken  at  short  intervals,  as  well  as  images  separated  by 
long  intervals,  and  we  can  use  information  from  both. 

The  AV  method  can  thus  proceed  in  two  stages: 

1)  A  short  baseline  stage,  in  which  a  succession  of 
frames  taken  at  short  distances  apart  is  examined. 
This  stage,  being  linear,  is  easily  solvable,  thereby 
providing  initial  estimates  for  the  depths  and  surface 
normals. 

2)  A  long  baseline  stage.  Now  that  an  estimate  exists, 
it  can  be  used  in  several  ways,  as  we  shall  see.  In  the 
Lambertian  case  we  use  it  to  establish  correspon¬ 


dence  between  points  seen  from  far-away  viewpoints, 
while  in  the  non-Lambertian  case  it  is  the  initial 
guess  in  a  non-linear  optimization  procedure. 

We  shall  show  that  with  this  method  we  can  recover 
the  geometry  of  the  visible  object  at  each  individual 
point  independently.  We  do  not  need  the  assumption 
mentioned  before,  of  global  optimization  (maximum 
smoothness)  of  the  whole  object.  Moreover,  it  turns  out 
that  in  spite  of  the  greater  amount  of  data,  our  task  is 
much  easier  than  in  the  previous  methods,  since  the 

recovery  process  can  be  done  for  each  point  separately, 
rather  then  having  to  deal  with  the  image  as  a  whole.  All 
this  is  done  in  a  stable  and  noise-resistant  way. 

2.2,  Geometrical  Preliminaries 

We  work  in  perspective  projection.  For  simplicity  we 
assume  that  the  camera  moves  with  its  optical  axis 
remaining  parallel  to  the  x  axis,  and  the  lens  is  in  the 
*,y  plane.  (Rotations  will  not  get  in  the  way  of  the  basic 
principles.)  The  coordinates  of  the  camera’s  lens  center, 
’h  >  moving  with  a  known  motion,  causing  changes 
in  the  brightness  of  the  Image  points.  The  coordinates  of 
the  image  points  are  measured  in  the  fixed  coordinate 
system,  regardless  of  the  camera’s  position,  and  are 
denoted  by  (with  *,  =-l,  i.e.  a  focal  length  of  1). 
The  coordinates  of  a  point  on  the  real  object  are  denoted 
by  *»  t$t  **»  •  One  has  the  relation  between  the  object, 
camera  and  image  coordinates: 


554 


y<  -yc  =~ 


Vo-Vc 
zo 

A  fixed  camera  forms  a  brightness  function  E(xi,yi) 
in  the  image  plane.  This  function  changes  when  the  cam¬ 
era  moves  (i.e.  when  xc,yc  change),  and  the  brightness 
now  depends  on  four  variables:  E(xi,yi‘,  xe,ye).  There  are 
two  possible  ways  to  represent  the  change.  (1)  Measure 
the  change  at  every  point  £,  ,y,-  of  the  image,  i.e.  find  the 
partial  derivatives  of  E  with  respect  to  and  also 

with  respect  to  xe  ,yc .  This  leads  to  optical  flow-like 
methods.  In  the  fluid  mechanical  analogy,  this  is  a 
representation  of  the  flow  in  the  Euler  coordinate  system. 
(2)  Follow  a  point  with  a  given  brightness  along  succes¬ 
sive  frames.  This  is  analogous  to  following  a  particular 
element  of  the  fluid  along  its  path  of  motion  and  record¬ 
ing  this  element’s  coordinates  and  their  derivatives.  This 
is  known  as  the  Lagrangian  system.  The  two  methods  are 
of  course  mathematically  equivalent,  and  the  choice 
between  them  depends  on  convenience  in  a  particular 
situation.  For  a  Lambertian  surface,  the  Lagrangian  sys¬ 
tem  has  an  obvious  advantage,  because  an  object  point 
projects  into  image  points  of  equal  brightness  in  all 
images.  Thus,  following  a  point  of  a  given  brightness 
along  successive  frames  means  following  the  same  object 
point.  It  is  also  preferable  in  the  non-Lambertian  case,  in 
a  modified  form,  as  we  shall  see. 

When  we  have  two  viewpoints,  with  either  a  short  or 
long  baseline,  two  matching  systems  of  epipolar  lines  are 
created,  one  for  each  image.  Unlike  the  single  image 
method,  in  which  the  image  has  to  be  treated  as  a  whole, 
the  two  camera  method  makes  it  possible  to  decouple  the 
problem  into  reconstruction  of  shape  along  epipolar  fines. 
In  more  detail,  consider  a  plane  containing  the  two  lens 
center  points  of  the  cameras.  It  intersects  the  two  image 
planes  in  straight  fines,  namely  two  matching  epipolar 
fines.  A  point  on  the  epipolar  fine  in  one  image 
corresponds  to  a  point  somewhere  on  the  matching  epipo¬ 
lar  fine  on  the  other  image  (belonging  to  the  same  plane); 
thus  we  only  have  to  find  the  correspondence  of  points 
along  epipolar  lines,  which  can  offer  a  significant 
simplification. 

In  the  following,  we  treat  the  Lambertian  and  non- 
Lambertian  cases  differently.  The  isotropy  of  the  Lamber¬ 
tian  reflectance  function  presents  us  with  both  a 
simplification  and  a  difficulty.  The  simplification  was 
mentioned  above,  namely  the  constant  brightness  in  all 
images  of  a  particular  object  point.  The  difficulty  is  that 
the  parameters  that  influence  the  brightness,  such  as  the 
surface  orientation  and  the  light  direction,  cannot  be 
recovered  by  changing  the  point  of  view,  as  the 
reflectance  function  is  independent  of  the  point  of  view. 
To  recover  the  orientation,  we  are  forced  to  use  spatial 
derivatives  of  the  brightness.  Happily,  this  makes  the 
problem  linear  even  for  the  long  baseline  step  (at  least  in 
our  particular  geometry). 

We  will  now  make  the  above  arguments  more  for¬ 
mal.  The  image  brightness  is  governed  by  the  “image 
irradiance  equatipn”  which  is  easily  generalized  to  our 


case: 

E(xi-yiixe,yc)  =  R{n  ,v,T)  (2.2) 

The  left  hand  side  is  measured  from  the  image  at  each 
camera  location  xc  ,yc .  The  right  hand  side  contains  our 
assumptions  about  the  fight  reflectance  properties  of  the 
surface.  This  reflectance  depends,  in  general,  on  the  sur¬ 
face  orientation  n,  on  the  viewing  direction  v ,  and  on  a 
vector  containing  a  finite  number  of  parameters,  T,  which 
represents  the  light  distribution  and  the  surface  intrinsic 
properties.  The  variables  in  R  were  separated  in  this  way 
because  n  is  the  unknown  that  we  are  mainly  interested 
in,  and  v  is  a  parameter  under  our  control,  as  the  camera 
moves  and  changes  the  viewing  direction.  (0  is  the  direc¬ 
tion  of  the  known  vector  (xi,y,)-(xc  ,yc).)  In  view  of  the 
above,  recovering  shape  from  shading  amounts  to  solving 
this  image  irradiance  equation  (for  n ). 

If  the  correspondence  problem  were  solved,  we  could 
use  the  above  equation  for  the  same  object  point  in 
different  views,  i.e.  different  vs.  We  thus  obtain  a  set  of 
equations  for  the  unknowns  in  R ,  in  particular  n. 
Whether  this  set  equations  is  degenerate  depends  on  the 
particular  R.  For  a  typical  R ,  n  and  v  are  coupled  in  a 
term  of  the  form  n  •  v ,  so  the  equations  are  not  degen¬ 
erate,  at  least  with  respect  to  finding  n.  For  a  Lamber¬ 
tian  surface,  however,  the  reflectance  is  independent  of  t> , 
and  n  cannot  be  found  in  this  way. 

2.3.  Lambertian  Surfaces 

The  Lambertian  case  is  special  in  that  the  shape 
reconstruction  process  can  be  decoupled  not  only  along 
epipolar  fines  but  also  along  isophotes.  That  is,  when  a 
contour  of  equal  brightness  moves  in  the  image  (with  the 
movement  of  the  camera),  the  new  contour  corresponds 
to  the  same  set  of  object  points  as  the  old  contour.  This 
is  because  a  Lambertian  surface  element  is  seen  with  the 
same  brightness  from  every  point  of  view.  Thus,  it  is  con¬ 
venient  to  parametrize  the  image  with  a  set  of  coordi¬ 
nates  consisting  of  the  epipolar  lines  and  the  isophotes. 
One  can  assign  some  labeling  to  the  isophotes,  which  we 
denote  by  a,  and  a  labeling  to  the  epipolar  lines,  denoted 
by  0.  The  particular  labeling  mapping  is  immaterial,  as 
long  as  it  is  well  behaved  and  increases  monotonically 
(with  respect  to  some  spatial  ordering  of  these  coordinate 
fines).  Such  a  well  behaved  mapping  is  always  possible  at 
least  at  some  neighborhood,  which  is  what  we  need  for 
taking  derivatives.  The  coordinate  fines  (epipolar  lines 
and  isophotes)  move  and  change  their  shape  as  the  cam¬ 
era  moves,  but  they  keep  their  labels  a,0.  This  is  in 
accordance  with  the  Lagrangian  coordinate  representa¬ 
tion.  The  key  advantage  of  the  scheme  is  that  an  image 
point  labeled  a,0  always  corresponds  to  the  same  object 
point.  The  correspondence  problem  is  then  essentially 
solved.  We  can  now  attach  the  labels  ( a,0 )  to  the  object 
point  ( x0,y0,z„ )  also.  (Two  views  are  needed  to  create 

epipolar  lines.  We  can  take  one  view  as  fixed,  say  at  the 
beginning  of  the  movement,  while  the  other  moves.  We 
will  not  need  the  image  from  the  fixed  view.  It  only 
serves  to  create  consistent  systems  of  epipolar  fines.) 


555 


By  measuring  the  position  in  the  image  of  the  point 
labeled  a,0  as  the  camera  moves,  i.e.  the  functions 
Xi(a,0,xe  ,ye),  y,(a,/?,zc  ,ye )  and  their  derivatives,  one  can 
infer  the  depth  and  normal  at  the  corresponding  object 
point.  As  we  shall  see,  the  derivatives  with  respect  to 
(xc  ,ye )  are  not  needed  except  in  find  an  initial  guess,  but 
the  derivatives  with  respect  to  a,0  are  the  ones  that 
enable  us  to  recover  the  normal.  Differentiating  relation 
(2.1)  we  obtain 


dx, 

i 

dx0 

X 

*o  ~xc 

dz0 

da 

zo 

da 

r 

da 

dVi 

dy0 

+ 

Vo  ~Vc 

dx0 

da 

da 

da 

dx, 

_1_ 

dx0 

i 

xo  ~xc 

dx„ 

d0 

*0 

d0 

i 

Xc2 

d0 

dy, 

_1_ 

dy0 

i 

Vo  -y( 

dz0 

d0 

20 

d0 

T 

x2 

d0 

(2.3) 


(2.4) 


where  the  left-hand  side  quantities  are  measured  from  the 
image,  and  the  right-hand  side  contains  the  unknowns. 

Defining  the  two  dimensional  vectors  ~20  =  (x0,y0 ), 
=(*,,!/,),  and  Hc  =  (xe,ye),  we  substitute  eqs.  (2.1)  in 
eqs.  (2.3,4)  to  eliminate  the  l/z02  factor  and  obtain 

8*i  1  #?„  dzo  . 

da  z„  da  z0  da  (  ’ 


and  a  similar  equation  in  0.  Using  this  equation  is  the  key 
idea  in  recovering  the  surface  orientation  in  the  Lamber¬ 
tian  case.  Its  intuitive  interpretation  is  that  as  the  cam¬ 
era  moves,  the  infinitesimal  distances  da,d0  between  two 
object  points  do  not  change,  but  the  corresponding 
i.e.  the  (geometrical)  distances  between  the  corresponding 
isophotes  (or  epipolars)  in  the  image,  do  change.  This 
change  depends  on  the  geometry  of  the  object,  as  is  seen 
in  the  above  equation,  and  thus  this  geometry  can  be 
recovered. 


For  a  short  baseline,  we  calculate  the  change  of  the 
above  derivatives  caused  by  the  camera’s  movement  by 
differentiating  with  respect  to  Tc : 


dadic 


1  dz„ 
dZc  z0  da 


(2.6) 


where  6,y  is  the  Kronecker  delta,  and  the  indices  tj  can 
take  either  of  the  two  values  j  or  }.  We  can  multiply 
eqns.  (2.1), (2.5), (2.6)  by  z0  to  obtain  the  linear  system  of 
equation: 


)x0  +  -X?c 

d&i 

1  ,  0  _j_  i-a  n 

=  0 

)dz°  -Q 

(2.7) 

(2.8) 

da  0  da 

T  V*1  -  *<r 

da 

a2z 

/n  n) 

dadt'  2° 

dfc  ,J 

da 

(2.9J 

The  solution  can  now  proceed  in  two  steps: 

1)  Solve  the  linear  system  of  six  equations  (2.7 ,8, 9)  for 
the  six  unknowns  ~20,  z0,  dtfjda,  dzjda.  This  is  an 
inhomogeneous  system  of  rank  six  (in  general)  and 
thus  has  a  unique  solution.  The  coefficients  of  the 
unknowns  do  not  have  to  come  from  measurements 
at  one  point.  (By  measurements  we  mean  the  func¬ 
tions  t,(a,0)  and  their  derivatives.)  Rather,  measure¬ 
ments  can  be  taken  from  several  points  along  the 
path  of  the  point  ^(a,0)  and  averaged.  As  the  equa¬ 
tions  are  linear,  the  averaged  measurements  can  be 
substituted  in  it  so  that  the  system  need  be  inverted 
only  once.  Thus  we  have  obtained  a  good  estimate  of 
the  unknowns. 

2)  The  accuracy  of  the  previous  step  may  not  be  good, 
as  we  used  a  second  derivative  (eq.  (2.9)).  This 
amounts  to  using  a  short  baseline  stereo  technique. 
For  better  accuracy  we  can  use  the  first  four  equa¬ 
tions,  (2.7)  and  (2.8),  at  two  far  away  points  (or 
camera  locations)  and  t'.  This  will  result  in  eight 
equations,  two  of  which  are  superfluous:  This  step 
needs  the  establishment  of  a  correspondence  between 
the  two  views.  Since  we  already  know  the  location  of 
the  object  points  roughly  from  the  first  step,  applied 
to  both  images  with  and  J*,  any  correspondence 
ambiguities  are  resolved.  Without  the  first  step, 
correspondence  ambiguities  could  arise  from  having 
several  points  with  the  same  brightness  along  one 
epipolar.  (Alternatively  to  the  first  step,  we  could 
simply  trace  the  point  ( a,0 )  from  frame  to  frame, 
but  then  we  may  need  continuous,  uninterrupted 
tracing.)  Now  the  longer  baseline  will  significantly 
improve  the  accuracy  and  stability.  Equation  (2.9), 
involving  the  derivatives  with  respect  to  £c ,  is  no 
longer  part  of  the  system.  Since  we  still  have  a  linear 
system  of  equations,  we  can  again  make  use  of  aver¬ 
aged  measurements  with  different  base  points,  and 
invert  a  system  of  equations  with  the  coefficients  cal¬ 
culated  from  the  averaged  measurements. 

The  above  steps  will  yield  the  location  of  the  object 
point  x0,y0,z0  and  one  tangent  on  the  surface,  namely 
dx0  dy0  dz0 

— — ,— — ,— — .  A  similar  procedure  using  the  0  parame- 
da  da  da 

ter  will  give  another  tangent.  Once  the  tangents  of  the 
surface  at  10,z0  are  known,  the  normal  can  immediately 
be  calculated  by  their  vector  product 

,  -  ./2f  dX°  dy°  dZ°  ]  v  i  dxo  dyo  dzo  I 

l  da  ’  da  '  da  )  l  d0  ’  d0  '  d0  \ 

where  A  is  the  scalar  product  of  the  above  tangent  vec¬ 
tors. 

A  few  points  are  worth  noting: 

1)  At  no  point  did  we  need  to  know  either  the  albedo 
(the  surface  reflectance)  or  the  light  characteristics. 
They  can  both  vary  from  point  to  point  on  the  sur¬ 
face  without  affecting  the  calculation.  Thus,  when  a 
change  in  the  brightness  is  detected,  we  are  able  to 


556 


tell  whether  it  results  from  a  change  in  the  geometry 
of  the  surface  or  from  the  reflectance  or  lighting  (and 
can  calculate  the  local  geometry).  This  is  unlike 
other  theories. 

2)  The  surface  is  not  necessarily  smooth,  as  is  assumed 
in  most  shape  reconstruction  theories.  If  one  of  the 
derivatives  used  is  too  large,  we  simply  label  this 
point  as  a  discontinuity  and  apply  our  method  to 
other  points  in  the  neighborhood. 

3)  The  tangents  have  not  been  derived  from  the  depth 
map,  which  would  have  been  quite  inaccurate.  They 

were  independent  variables  in  a  system  of  equations 
that  determined  both  the  depth  and  the  tangents.  In 
would  be  of  interest  to  carry  out  an  error  analysis  of 
the  results. 

4)  The  equations  for  both  the  short  and  long  baseline 
turned  out  to  be  linear,  which  frees  us  from  the 
recurring  hard  problems  of  non-linear  optimization. 

5)  Using  the  brightness  function  at  several  viewpoints 
was  not  enough  to  recover  the  surface  normal,  and 
we  needed  its  derivatives  dH,  / 3a, <9?,- / 80.  The 
infinitesimals  da, 80  do  not  change  from  one  image 
to  another,  but  35,  changes  in  a  way  dependent  on 
the  normal,  which  can  thus  be  recovered.  We  will 
not  need  the  derivatives  in  the  non-Lambertian  case 
(except  at  the  first  stage). 

6)  The  formalism  is  applicable  to  any  contours,  not 
necessarily  isophotes.  Many  contours  that  are 
marked  on  the  object  can  be  detected  on  an  image 
by  means  other  than  changes  in  shading.  For 
instance,  contours  can  be  formed  by  changes  in  tex¬ 
ture  or  color.  If  the  change  is  continuous,  such  as  a 
gradual  color  change,  we  can  draw  contours  on 
which  some  property  such  as  color  remains  constant. 
We  can  then  label  the  contours  in  the  same  way  we 
labeled  the  isophotes  and  apply  the  above  formalism 
without  change.  We  can  thus  find  the  position  and 
orientation  of  the  visible  surface  elements.  If  the 
change  that  produced  a  contour  was  a  sharp  discon¬ 
tinuity,  the  contour  is  isolated.  In  this  case  we  can¬ 
not  measure  the  derivative  of  the  contour’s  property 
along  epipolar  lines,  the  way  we  measured  the 
derivative  of  the  isophote’s  brightness.  We  can  still 
measure  the  other  derivative,  i.e.  of  the  epipolar  line 
along  the  contour.  (Formally,  we  can  differentiate 
with  respect  to  0,  but  not  a).  This  is  enough  to  find 
the  position  and  direction  of  the  contour  element  (by 
solving  eqs.  2.7, 8, 9  with  0).  This  line  element  lies  on 
a  visible  surface  element,  and  its  direction  is  the 
direction  of  one  tangent  to  that  surface.  In  this  case, 
then,  without  further  information,  the  surface 
element’s  orientation  can  be  determined  only  up  to 
rotation  around  the  contour  element. 

2.4.  Non-Lambertian  Surfaces 

When  the  reflectance  function  has  a  non-isotropic 

component,  the  reconstruction  becomes  more  difficult. 

The  isophotes  at  different  viewpoint  no  longer  correspond 


to  the  same  object  point.  Thus  the  simple  Lagrangian  for¬ 
malism  described  above  has  to  be  modified.  Additionally, 
more  unknowns  are  added  to  the  problem,  as  there  are 
more  parameters  in  the  reflectance  function,  including  the 
relative  strengths  of  the  nonisotropic  components.  (They 
enter  the  vector  T  in  eq.  (2.2).)  This  reflectance  function 
is  in  general  non-linear.  We  think  that  the  Lagrange  for¬ 
malism,  namely  following  points  on  the  changing  image, 
is  still  preferable  to  the  Euler  formalism  used  in  most 
optical  flow  theories  (measuring  changes  at  a  fixed  point 
on  the  image)  because  it  enables  us  to  deal  with  an  object 
point  (and  its  neighborhood)  separately  from  other 
points,  while  a  fixed  point  on  the  image  corresponds  to 
different  object  points  during  the  motion  of  the  camera. 

Because  simply  following  the  isophotes  is  not  useful 
as  in  the  Lambertian  case,  the  image  brightness  E  has  to 
be  taken  into  account  explicitly.  It  is  useful  to  work  in  a 
five-dimensional  vector  space  V,  whose  components  are 
In  the  fluid  mechanical  analogy,  the  subspace 
spanned  by  first  three  components,  E ,  is  somewhat 
analogous  to  a  “phase  space”,  while  the  te  represents 
temporal  dimensions.  Thus,  for  one  viewpoint  (fixed  ), 
the  image  is  a  2-D  surface  in  the  above  3-D  subspace.  It 
is  simply  the  surface  described  by  brightness  function 
E{xi,yi).  As  the  camera  moves  along  some  (known)  path 
~ZC  (7)  in  the  dimensions,  it  creates  a  one-parameter 
family  of  such  2-D  surface,  so  that  we  have  a  3-D  surface 
in  the  5-D  space.  This  surface  is  our  data,  measured  from 
the  images.  We  shall  call  it  the  “brightness  surface”,  to 
distinguish  it  from  the  visible  surface. 

Consider  again  the  image  irradiance  equation 

E{xj  y,  ;a:c  ,»c )  —  i?  (n  ,t>,iT)  (2.2) 

where  the  viewing  direction  v  depends  on  ,?t : 

.  _  5  ~  % 

to-*.  1 

This  equation  has  to  be  solved  for  each  object  point. 
We  assume  that  the  functional  form  of  R  is  the  same  at 
each  point,  with  different  parameters.  The  3-D 

brightness  surface  appears  on  the  left-hand  side  of  this 
equation.  To  solve  the  equation,  we  need  to  represent 
the  right  hand  side  too.  This  can  be  done  by  representing 
each  visible  surface  element  (around  an  object  point)  as  a 
trajectory  in  the  5-D  space.  For  such  an  element,  all  the 
unknowns  n  ,z0  are  fixed.  The  camera  moves  along 
the  path  ~2C  (ry),  causing  a  simultaneous  move  ?,  (7)  in  the 
image  coordinates,  in  accordance  with  the  perspective 
geometry  equation,  eq.(2.1).  The  light  reflected  by  the  ele¬ 
ment  in  the  viewing  direction  also  changes.  Using  the 
reflectance  function  R,  we  can  compute  R(~i).  Now  we 
can  plot  a  trajectory  in  the  5-D  space,  with  R  being  plot¬ 
ted  in  the  E  dimension.  We  obtain  the  curve 

rh)=(^h),iib),i?(',)) 

In  summary,  we  have  a  trajectory  r(i)  for  every  set  of 
unknown  parameters  pertaining  to  one  visible  surface  ele¬ 
ment. 


557 


If  such  a  surface  element  lies  on  our  visible  object, 
then  its  trajectory  in  the  5-D  space  V  must  lie  on  the  3-D 
brightness  surface  (representing  E  in  that  space).  This  is 
is  because  of  the  equality  E  =  R ,  i.e.  the  reflectance  cal¬ 
culated  for  the  surface  element  has  to  match  the  meas¬ 
ured  image  brightness,  in  every  image  observed. 

Since  the  visible  object  is  made  of  such  elements,  the 
3-D  brightness  surface  created  by  the  object  in  the  5-D 
space  is  made  up  of  such  individual  trajectories.  We  can 
distinguish  between  the  different  trajectories  by  their 
starting  point.  Looking  at  the  first  image  on  the 
camera’s  path,  each  point  %  belongs  to  one  surface  ele¬ 
ment,  and  can  be  used  to  label  the  corresponding  trajec¬ 
tory. 

The  solution  of  the  image  irradiance  equation 
E  —  R  for  each  surface  element  is  now  reduced  to  finding 
a  set  of  unknowns  ri  ,7„  ,20  .s'  for  which  the  element’s  tra¬ 
jectory  in  the  5-D  space  lies  entirely  on  the  brightness 
surface  (and  has  a  given  starting  point).  Correspondence 
is  not  an  issue  here.  A  trajectory  in  V  immediately 
defines  a  correspondence  between  successive  images. 
Stated  differently,  the  correspondence  problem  has  been 
merged  into  the  trajectory  matching  problem. 

So,  by  following  trajectories  that  are  generated  by 
single  object  points,  we  have  been  able  to  decouple  the 
reconstruction  problem  and  solve  separately  for  small  sets 
of  parameters,  belonging  to  each  object  point,  rather  then 
dealing  with  the  image  as  a  whole.  The  usual  theories,  in 
contrast,  lead  to  a  large  set  of  coupled  equations  involv¬ 
ing  all  the  image  points.  This  trajectory  following  method 
can  be  associated,  in  a  way,  with  the  Lagrange  system  of 
fluid  mechanics. 

Having  clarified  the  theoretical  vision  aspect  of  the 
problem,  we  have  now  to  deal  with  the  more  practical 
problem  of  fitting  a  trajectory  to  a  surface.  First,  how  do 
we  define  fitting?  One  way  is  to  demand  that  the  equa¬ 
tion  E  =  R  be  satisfied  at  several  points  along  the  trajec¬ 
tory.  Thus  we  obtain  a  set  of  equations,  one  from  each 
such  point,  for  the  set  of  unknowns.  If  the  number  of 
points  is  at  least  as  large  as  the  number  of  unknowns,  a 
solution  can  be  found,  in  a  generic  case.  If  it  is  larger,  the 
reliability  improves.  This  is  similar  to  the  situation 
described  in  Section  2.  As  noted  there,  the  system  is  not 
degenerate  because  the  geometrical  unknowns,  (n  ,!?„  ,z0 ), 
are  coupled  to  the  viewing  direction  v.  The  other  param¬ 
eters  in  R  which  have  some  coupling  to  the  viewing 
direction  will  also  be  recovered.  For  instance,  having  a 
single  light  source,  we  may  find  its  direction,  and  find  the 
product  of  the  intensity  with  the  surface  albedo,  but  the 
latter  cannot  be  separated  in  this  method. 

In  practice,  because  of  noise  and  other  inaccuracies, 
it  may  be  impossible  to  find  such  a  solution.  Thus  it  is 
preferable  to  turn  the  problem  into  one  of  optimization. 
Unlike  theories  such  as  regularization,  this  optimization  is 
not  a  result  of  some  additional  assumptions  such  as 
smoothness,  which  are  not  needed  here.  It  is  simply  a  way 
to  make  the  tolerance  limit  of  the  fitting  less  stringent. 


One  functional  we  can  optimize  is  the  sum  of  dis¬ 
tances  from  each  trajectory  point  to  the  nearest  bright¬ 
ness  surface  point.  For  simplicity,  we  measure  the  dis¬ 
tance  in  the  E  dimension  only,  taking  the  coordinates 
z).  ,2J  to  be  the  same  for  both  the  trajectory  and  the  sur¬ 
face.  Thus,  we  seek  to  minimize  the  functional 

f  ( R  ,z0  ),SC  n  ,«*]  -  £(?,  ,?c ))  2d~i 
Jr(i) 

over  all  possible  values  of  the  unknowns  ~S0,z0,fi,t,  with 
the  constraint  of  passing  through  a  given  starting  point. 
The  unknowns  enter  the  problem  through  both  their 
explicit  appearance  in  the  integrand  and  their  determina¬ 
tion  of  the  trajectory  F(7).  (Recall  that  £,  is  determined 

by  eq.  (2.1),  which  contains  the  unknown  position  a?0,20 
of  the  object  point). 

Although  the  number  of  unknowns  is  small,  we  may 
still  face  difficulties  resulting  from  the  non-linear  nature 
of  the  problem.  Our  strategy  to  deal  with  that  is  similar 
to  the  one  we  used  in  the  Lambertian  case.  In  step  1,  we 
use  a  very  short  trajectory,  and  linearize  our  expressions, 
using  a  Taylor  expansion  of  R.  Alternatively,  higher 
derivatives  can  used  instead  of  several  close  points  along 
the  trajectory.  The  solution  of  the  linear  equations  pro¬ 
vides  a  good  initial  guess  for  step  2. 

Step  2  is  the  non-linear  optimization  described 
above.  The  small  number  of  unknowns  and  the  good  ini¬ 
tial  guess  make  the  task  quite  easy. 

As  in  the  Lambertian  case,  no  use  has  been  made  of 
a  smoothness  assumption.  Furthermore,  the  position  in 
space,  as  well  as  the  orientation,  have  been  recovered  for 
each  object  point  separately.  The  other  parameters  in  the 
reflectance  function  which  are  coupled  to  the  viewing 
direction  are  also  recovered.  The  multiple  viewpoints 
make  the  result  reliable  and  stable.  For  the  Lambertian 
case  this  method  has  a  certain  degeneracy,  as  n  and  s’ 
cannot  be  separated  solely  by  moving  the  camera,  as 
noted  before. 

2.5.  Summary  and  Conclusions 

In  evaluating  the  AV  paradigm  for  recovering  shape 
from  shading,  two  major  benefits  arise: 

1)  More  viewpoints  give  us  more  information,  and  allow 
us  to  dispose  of  the  restrictive  assumptions  used  in 
previous  research  and  increase  the  stability  with 
respect  to  noise  and  other  errors. 

2)  One  would  think  that  with  more  images  the  task  of 
processing  the  given  information  will  be  harder  then 
for  one  or  two  viewpoints.  In  fact,  just  the  opposite 
has  happened,  because  we  have  been  able  to  decou¬ 
ple  the  handling  of  the  individual  object  points,  and 
solve  the  problem  in  small,  separate  parts. 

These  advantages  together  allow  us  to  infer  the 
geometrical  parameters  as  well'  as  reflectance  function 
parameters  at  each  point  individually.  This  means  that 
we  recover  the  true  shape  of  the  object,  rather  then  a 
smoothed  and  “optimized”  version  of  it,  as  other  theories 
do.  This  also  resolves  the  perennial  problem  of  whether  a 


558 


change  in  the  image  brightness  is  caused  by  a  change  in 
the  object’s  geometry,  or  by  other  light  reflectance  fac¬ 
tors. 

3.  SHAPE  FROM  CONTOUR 

Here  we  study  the  problem  of  the  detection  of  shape 
from  contour  by  an  active  observer,  and  we  compare  the 
performance  of  this  new  active  scheme  with  the  passive 
approach.  We  consider  the  contour  to  be  planar.  Work  on 
nonpianar  contours  can  be  found  in  (Ito  and  Aloimonos, 
1987a). 

3.1.  Introduction 

The  human  perceiver  is  able  to  derive  enormous 
amounts  of  information  from  the  contours  in  a  scene.  As 
part  of  this  capacity,  we  are  able  to  use  the  shapes  of 
image  contours  (as  they  are  seen  by  both  eyes)  to  infer 
the  shapes  and  dispositions  in  space  of  the  surfaces  they 
lie  on,  as  well  as  their  motion.  The  interpretation  of  con¬ 
tours  by  a  binocular  observer  involves  several  subprob¬ 
lems  (following  Witkin,  1981): 

a)  Locating  contours  in  the  images 

If  contours  are  to  be  used  to  infer  anything,  they 
must  be  found.  The  human  perceiver  has  little  difficulty 
deciding  what  is  and  is  not  a  contour,  yet  the  automatic 
detection  of  edges  has  proved  very  difficult.  Perhaps  this 
fact  should  not  be  surprising;  the  contours  that  we  see  in 
natural  images  usually  correspond  to  definite  physical 
events,  such  as  shadows,  depth  discontinuities,  color 
differences  and  the  like.  Our  ability  to  detect  these 
events  may  say  more  about  their  significance  for  image 
interpretation  than  about  their  ease  of  detection.  Why 
should  we  expect  events  that  have  simple  descriptions  in 
terms  of  the  structure  of  the  scene  to  have  simple 
descriptions  in  terms  of  the  image  intensity  as  well?  If 
the  physical  significance  of  contours  is  taken  as  their  pri¬ 
mary  feature,  then  at  least  we  know  what  is  being 
detected,  even  if  we  don’t  know  how.  But  recent 
research  (Nalwa,  1985)  shows  that  we  are  reasonably  well 
advanced  as  far  as  detection  of  contours  goes.  Actually, 
we  can  say  that  we  can  fairly  well  detect  the  contours  in 
an  image,  even  if  there  are  some  inaccuracies. 

b)  Labeling  contours  (i.e.  distinguishing  contours  which 

are  due  to  different  physical  events) 

If  contours  correspond  to  different  physical  events, 
then  an  essential  component  of  their  interpretation  must 
be  to  decide  which  contours  denote  which  event,  since 
each  kind  of  contour  imparts  a  different  meaning.  Recent 
work  has  shown  that  strong  structural  constraints  can  be 
applied  to  distinguish  one  kind  of  contour  from  another. 

c)  Interpreting  contours 

Even  after  contours  have  been  found  and  labeled, 
not  much  is  known  about  the  physical  structure  of  the 
scene.  It  is  clear  that  contours  play  an  important  role  in 
the  human  perceiver’s  ability  to  decide  how  things  are 
shaped  and  where  they  are,  apart  from  the  application  of 
specific  “higher  level”  knowledge  to  objects  of  known 


shape. 

In  this  section  we  study  this  problem  of  contour 
interpretation. 

3.2.  The  Passive  Approach  (Previous  Work) 

The  recovery  of  three-dimensional  shape  and  surface 
orientation  from  a  two-dimensional  contour  is  a  funda¬ 
mental  process  in  any  visual  system.  Recently,  a  number 
of  methods  have  been  proposed  for  computing  shape  from 
contour.  For  the  most  part,  previous  passive  techniques 
have  concentrated  on  trying  to  identify  a  few  simple,  gen¬ 
eral  constraints  and  assumptions  that  are  consistent  with 
the  nature  of  all  possible  objects  and  imaging  geometries 
in  order  to  recover  a  single  “best”  interpretation  from 
among  the  many  that  are  possible  for  a  given  image.  For 
example,  Kanade  (1981)  defines  shape  constraints  in 
terms  of  image  space  regularities  such  as  parallel  lines 
and  skew  symmetries  under  orthographic  projection. 
Witkin  (1981)  looks  for  the  most  uniform  distribution  of 
tangents  to  a  contour  over  a  set  of  possible  inverse  pro¬ 
jections  in  object  space  under  orthography.  Similarly, 
Brady  and  Yuille  (1984)  search  for  the  most  compact 
shape  (using  the  measure  of  area  over  perimeter  squared) 
in  the  object  space  of  inverse  projected  planar  contours. 

Rather  than  attempting  to  maximize  some  general 
shape-based  evaluation  function  over  the  space  of  possible 
inverse  projective  transforms  of  a  given  image  contour, 
and  adhering  to  our  framework  of  attempting  unique 
solutions  without  employing  any  restrictive  assumptions 
and  heuristics,  we  propose  to  find  a  unique  solution  by 
using  an  active  approach,  since  it  can  be  easily  proved 

that  one  image  of  a  planar  contour  (under  orthography  or 
perspective)  admits  infinitely  many  interpretations  of  the 
structure  of  the  world  plane  on  which  the  contour  lies,  if 
no  other  information  is  known.  Finally,  the  need  for  a 
unique  solution,  which  is  guaranteed  in  our  approach, 
arises  also  from  the  fact  that  there  exist  many  real  world 
counterexamples  to  the  evaluation  functions  that  have 
been  developed  to  date.  For  example,  Kanade’s  and 
Witkin’s  measures  incorrectly  estimate  surface  orientation 
for  regular  shapes  such  as  ellipses  (which  are  often  inter¬ 
preted  as  slanted  circles).  Brady’s  compactness  measure 
does  not  correctly  interpret  non-compact  figures  such  as  a 
rectangle  since  he  will  compute  it  to  be  a  rotated  square 
(e.g.  if  we  view  a  rectangular  table  top,  we  do  not  see  it 
as  a  rotated  square  surface,  but  as  a  rotated  rectangle). 
It  is  worth  noting  that  the  equations  used  in  previous 
work  (the  passive  approach)  are  highly  nonlinear. 

Up  to  now  we  have  only  discussed  monocular  passive 
shape  from  contour.  It  would  seem  that  detecting  shape 
from  contour  employing  a  binocular  observer  might 
reduce  ambiguity  and  nonlinearity.  It  is  shown  in  the 
sequel  that  this  is  not  the  case,  i.e.,  even  if  we  employ  a 
binocular  static  observer,  the  problem  is  still  nonlinear,  if 
we  want  to  avoid  solving  the  correspondence  problem 
between  the  left  and  right  frames.  (We  have  stated  that 
our  approach  will  be  correspondenceless.)  Indeed,  consider 
a  binocular  observer  imaging  a  plane  contour  C  with  pro- 


559 


jections  CL  and  CR  on  the  left  and  right  frames  respec¬ 
tively.  Let  a  coordinate  system  OXYZ  be  fixed  with 
respect  to  the  left  camera,  with  the  Z  axis  pointing  along 
the  optical  axis.  We  assume  that  the  image  plane  I„h  is 
perpendicular  to  the  Z  axis  at  the  point  (0,0,1).  Let  the 
nodal  point  of  the  right  camera  be  the  point  (</,0,0),  and 
let  its  image  plane  1^  be  identical  to  the  previous  one. 
Consider  also  a  plane  P  in  the  world  with  equation 
Z  =  pX  +  qY  +  c ,  which  contains  a  contour  C,  and  con¬ 
sider  the  images  (perspective)  C,  and  C,  of  the  contour 
on  the  left  and  right  image  planes  respectively. 

From  now  on  we  will  denote  the  coordinates  on  the 
left  and  right  image  planes  by  (x,  ,y, )  and  (xr  ,yr )  respec¬ 
tively.  We  assume  perspective  projection.  We  can  easily 
prove  that  if  S( ,  Sr  are  the  areas  enclosed  by  contours  Ct 
and  CT  and  ( AL,BL ),  (AR  ,BR )  the  centers  of  mass  of 
contours  Ct  and  Cr ,  then 

SL 

SR  \-ARp-BRq'  (  J 

where  (p,q)  is  the  gradient  of  the  plane  II  with  equation 
Z=pX  +  qY  +  c  with  respect  to  the  left  frame,  and 
where  we  have  assumed  (for  simplicity  and  without  loss 
of  generality)  that  the  focal  length  /  =  1.  For  a  proof  of 
(3.1)  see  Appendix  1.  If  we  want  to  recover  the  shape  of 
the  contour  in  view  without  any  assumptions,  what  we 
should  do  is  connect  properties  of  the  left  and  right 
images,  if  we  don’t  want  to  resort  to  correspondence. 
Such  properties  can  include  area,  perimeter,  any  function 
of  these  two,  or  other  functions  of  the  positions  of  the 
contours.  Equation  (3.1)  is  linear  and  is  the  only  linear 
constraint  we  have  been  able  to  find. 

Another  constraint  can  be  extracted  from  the  perim¬ 
eters  of  the  two  contours.  If  we  calculate  the  perimeter 
of  the  world  contour  from  each  of  two  projections,  then 
these  two  results  should  be  equal.  From  this  we  can  get 
an  additional  constraint  on  p  ,q . 

To  do  this,  we  need  to  develop  the  first  fundamental 
form  of  the  world  plane  as  a  function  of  the  retinal  coor¬ 
dinates,  in  order  to  be  able  to  compute  the  length  of  the 
world  contour  (up  to  a  constant  factor,  of  course).  If  we 
fix  a  coordinate  system  OXYZ  with  the  Z  axis  as  the 
optical  axis  and  focal  length  1  and  we  consider  a  plane  II: 
Z  =  pX  +  qY  +  c  in  the  world  with  a  contour  C  on  it, 
and  we  denote  by  (x,y)  the  coordinates  on  the  image 
plane,  then  a  point  (X,Y,Z)  in  the  world  planar  contour 
C  is  projected  onto  the  point 


The  inverse  imaging  function,  call  it  /,  is  the  func¬ 
tion  that  maps  the  image  plane  onto  the  world  plane;  so, 
if  (x,y)  is  an  image  point,  the  3-D  world  point  on  the 
plane  Z  —  pX  +  qY  +  c  that  has  (x,y)  as  its  image  is 
given  by 


cy_ 


qy  ’  1  -px-qy 


c 

1  -px-qy 


The  first  fundamental  form  of  /  [Lipschutz,  1969]  is 
the  quadratic  form 

E  dx2  +  2F  dx  dy  +  E  dy2  , 

with 

F  =  /*  ■  /y  and 

After  simple  calculations  we  get 

E  =  7\ - - - 7T  k1  =  IV?  +  P2y2  +  Pi\ 

(1  -px-qy)4  L 

c  2 

F  =  ~~x  -  qy  )4  t(1  ■  qV  )qX  +  (1  "  ■ px  ]py  +  W] 

G  =  77  —  a  [« a*2  +  (i  - +  ?21 

(1  -  ps  -  qy  )4  L  J 

So,  if  we  consider  two  points  (x,y)  and  (x  +  dx, 
y  +  dy)  on  the  image  plane,  then  the  three-dimensional 
distance  dC  of  the  corresponding  points  on  the  world 
plane  is  given  by 

dC  =  \J(Edx2  +  2F  dx  dy  +  G  dy2) 

Consequently,  if  we  have  a  contour  C  on  the  image 
plane,  then  the  3-D  planar  contour  has  length 

j '  (E  dx2  +  2 F  dx  dy  +  Gdy 2) 

Using  the  above  equation  we  can  compute  the  length  of 
the  world  contour  (up  to  a  constant  factor)  from  both  the 
left  and  right  frames,  and  equate  the  results.  This  will 
result  in  an  equation  (nonlinear)  with  unknown  p  ,q . 
This  equation  may  be  solved,  together  with  the  linear 
constraint  (Aloimonos,  1986),  to  give  the  orientation  gra¬ 
dient  of  the  contour.  Uniqueness  is  not  guaranteed 
theoretically,  and  the  nonlinearity  could  create  instabili¬ 
ties. 

We  now  proceed  to  study  the  same  problem,  but 
using  an  active  observer.  We  distinguish  between  a 
monocular  and  a  binocular  observer. 


3.3.  The  Active  Approach 


3.3.1.  Monocular  Observer 

This  problem  has  already  been  addressed  by 
Aloimonos  (1986)  and  Kanatani  (1986).  In  Kanatani’s 
scheme,  the  method  developed  is  an  application  of  the 

linear  feature  theory  introduced  by  Amari  (1978).  The 
treatment  is  for  differential  motion.  In  Kanatani’s 
scheme,  if  we  have  a  closed  curve  C  on  the  image  plane, 
then  features  are  defined  as  various  line  integrals  along  C 
of  the  form 


/=  f  F(x,y)ds  , 

- 

with  ds  =s  v  dx2  +  dy2,  and  F  any  differentiable  function. 
Then,  the  change  —  as  the  observer  moves  is  connected 


560 


through  linear  equations  to  the  gradient  (p ,9)  of  the 
plane  on  which  the  contour  lies.  In  other  words,  if  the 
observer  moves  with  a  known  motion,  then  from  the  two 
successive  images  of  the  contour,  the  shape  (p,q)  of  the 
three-dimensional  contour  is  uniquely  computed,  if  cer¬ 
tain  conditions  are  satisfied.  The  solution  is  given 
through  linear  equations.  The  sensitivity  of  the  method 
to  noise  depends  on  the  error  introduced  form  the  numer¬ 
ical  differentiation  and  only  on  this,  and  as  reported  the 
method  seems  unstable  in  the  presence  of  noise. 

3.3.2.  Binocular  Observer 

Here  we  examine  the  active  perception  of  shape  from 
contour  by  a  binocular  observer.  Again,  let  a  coordinate 
system  be  fixed  with  respect  to  the  left  camera  with  the 
Z  axis  pointing  along  the  optical  axis.  We  consider  that 
the  image  plane  Imi  is  perpendicular  to  the  Z  axis  at  the 
point  (0,0,1).  Let  the  nodal  point  of  the  right  camera  be 
at  (d  ,0,0),  and  let  its  image  plane  /m  be  identical  to  the 
previous  one.  Consider  a  plane  P  in  the  world  with 
equation  Z  =  pX  +  qY  +  c  that  contains  a  contour  C 
and  consider  the  perspective  images  C(  and  Cr  of  the 
contour  on  the  left  and  right  image  frames  respectively. 
Let  Si  and  SR  be  the  areas  of  the  left  and  right  image 
contours  respectively.  Then  (see  Appendix  1) 

^  1  Aip-Biq 

SR  1  -ARp-BRq'  K  ’ 

where  (AL,BL),  (AR,BR)  are  the  centers  of  mass  of  the 
left  and  right  contours  respectively.  We  call  this  con¬ 
straint  the  area-ratio  constraint.  Now,  we  rotate  both 
eyes  by  a  small  angle  6  around  the  X  axis  (this  can  also 
be  simulated).  The  new  image  coordinates  are 

rnx  +r2ly  +  r31 
x'  = - 

r  13  1  +  r  23  J)  +  r  33 


r12  x  +  r22  y  +  r32 
y  — -  ,  where 

r13I+r23V  +  r33 


R  —  (ri;)3x3  — 


Hence 


, _  sine)  +  y  cosd 

cosd -y  sind 

Then  the  new  gradient  (p',9')  of  the  world  contour  is 


p'= 

_ t _ 

cosd  +  9  sine) 

(3.3) 

9  cos9  -  sin9 

(3.4) 

9'  = 

cosd  +  9  sind 

But  again  from  the  area  ratio  constraint 

sH 

1  -Afo'-Bfo' 

(3.5) 

I-ArP'-BrI1 

where  {a£,b£\  ( A§,B$ )  are  the  centers  of  mass  of  the 
left  and  right  contours  after  the  rotation  and  S’*,  SR  are 
their  areas,  respectively. 

Using  (3.3),  (3.4)  in  (3.5)  we  get 

S?  cosd  +  q  sind  -  App  -  b£{  q  cosd  -  sind) 

Sj?  cosd  +  9  sind  -  A  *p  -  Br{  9  cosd  -  sind) 

Equations  (3.2)  and  (3.6)  constitute  a  linear  system  in  the 
unknowns  p  and  q  that  has  a  unique  solution  in  general. 
These  equations,  however,  become  degenerate  when 
p  =  0;  this  is  because  when  p  =  0,  both  equations  reduce 
to 

SL  _  1  ~BLg 
Sr  Br<1 

But  the  y-coordinates  are  the  same  in  both  images,  and 
so  Bl  =  Br  and  consequently  SL  =  SR .  So,  if  p  —  0  the 
areas  in  both  images  are  equal.  Appendix  2  develops  a 

condition  that  proves  that  this  is  sufficient  too,  i.e.  p  —  0 
iff  the  areas  in  both  images  are  equal.  Here,  we  devise  a 
method  for  computing  q  in  case  p  =0. 

We  can  easily  prove  that  if  both  p  and  q  are  zero 
(world  plane  parallel  to  image  planes),  then  the  length  of 
the  contours  in  both  images  (perimeters)  are  equal.  It  is 
not  sufficient,  though,  for  the  lengths  of  the  contours  to 
be  equal.  This  does  not  imply  that  p  =  q  =  0;  there  are 
some  degenerate  cases,  which  are  discussed  in  (Alaimonos 
and  Basu,  1986).  For  the  purposes  of  this  section,  assum¬ 
ing  that  we  can  check  for  the  degenerate  cases,  we  pro¬ 
pose  the  following  algorithm  for  actively  perceiving  shape 
from  contour  in  the  case  of  p  ==  0. 

Step  1:  Rotate  the  cameras  so  that  discrepancy  between 
the  lengths  of  the  left  and  right  contours  is  minim¬ 
ized. 

Step  2:  q  corresponds  to  the  rotation  that  minimizes  the 
discrepancy. 

Let  (0,g  ,-l)  be  the  surface  normal  of  the  world  contour. 
Rotating  the  camera  by  0  around  the  x-axis,  we  get 

p'|  10  0  W  0 

9'  =  0  cos#  -sinfl  9  or 

k'>  0  sin0  cosd  -1 

P'|  0 

q'  =  ycosfl  +  sinO 

k'  qs\n6-cos6 

We  need  9'  =  0,  hence 

9  cos 9  +  sin0  =  0  or  9  =  -  tan0  . 

Thus,  if  9  is  the  rotation  that  minimizes  the  difference 
between  contour  lengths,  then  9  =  -  tan0  gives  the 
corresponding  9.  A  similar  analysis  can  be  done  for  a 
verging  stereo  system.  The  mathematics  is  much  more 
complicated  and  it  can  be  found  in  (Aloimonos  and  Basu, 
1986). 

3.4.  Summary  and  Discussion 


561 


We  have  presented  a  theory  for  the  computation  of 
shape  from  contour  by  an  active  observer.  The  con¬ 
straints  involved  demonstrate  the  superiority  of  an  active 
observer  vs.  a  passive  one,  with  respect  to  computation. 
Uniqueness  and  linearity  vs.  ill-posedness  and  instability 

make  the  AV  paradigm  with  respect  to  the  shape  from 
contour  problem  very  appealing  and  worth  studying.  The 
advantage  of  binocular  shape  from  contour  over  monocu¬ 
lar  shape  from  contour  lies  in  the  fact  that  the  monocular 
case  has  been  demonstrated  to  be  unstable.  The  problem 
with  the  active  binocular  shape  from  contour  theory  is 
that  the  contours  in  the  left  and  right  images  have  to  be 
corresponded.  We  have  not  solved  this  problem  but  it  is 
certainly  easier  to  correspond  macrofeatures  (contours) 
rather  than  microfeatures.  Finally,  the  problem  of  shape 
from  nonplanar  contour,  i.e.  understanding  the  structure 
of  the  contour  without  correspondences  in  an  active  way, 
is  treated  in  (Ito  and  Aloimonos,  1987a).  We  chose  here 
the  case  of  a  planar  contour,  because  this  case  has  been 
addressed  extensively  by  past  research. 

4.  SHAPE  FROM  TEXTURE 

The  problem  of  shape  from  texture  has  received  a  lot 
of  attention  in  the  past  few  years  and  some  excellent 
research  on  the  topic  has  been  published  (Gibson,  1950; 
Stevens,  1980;  Witkin,  1980;  Davis  et  al.,  1983;  Kanatani, 
1984;  and  Aloimonos,  1986).  The  problem  in  the  passive 
case  is  defined  as  “finding  the  orientation  of  a  textured 
surface  from  a  static  monocular  view  of  it.”  This  problem 
is  ill-posed  in  the  sense  that  there  exist  infinitely  many 
solutions.  To  restrict  the  space  of  solutions,  assumptions 
have  to  be  made  about  the  texture.  Assumptions  such  as 
directional  isotropy  and  uniform  density  have  been 
employed  in  previous  research.  Uniform  density  has  been 
defined  as  density  of  texels  or  density  of  the  sum  of  the 
lengths  of  the  contours  (zero-crossings)  in  the  image. 

It  is  very  clear  that  even  though  some  of  the 
assumptions  used  in  the  literature  for  the  -ecovery  of 
shape  from  texture  are  general  enough,  they  are  not 
powerful  enough  to  capture  a  very  large  subset  of  natural 
images.  As  a  result,  the  developed  algorithms  fail  when 
they  are  applied  to  many  real  surfaces.  Furthermore, 
there  is  no  way  to  check  in  advance  whether  or  not  a  par¬ 
ticular  assumption  is  valid  for  the  surface  that  is  imaged. 
This  problem  alone  is  enough  to  demonstrate  the  res¬ 
tricted  applicability  of  the  existing  shape  from  texture 
algorithms  (or  of  the  ones  yet  to  come).  We  will  show 
that  if  the  observer  is  active  then  the  shape  from  texture 
problem,  or  the  problem  of  shape  detection  from  surface 

intensity  and  markings,  becomes  easy,  in  the  sense  that 
no  restrictive  assumptions  are  necessary  and  the  solution 
is  obtained  from  linear  equations. 

The  next  section  introduces  the  mathematical  prere¬ 
quisites.  For  simplicity,  and  without  loss  of  generality, 
we  will  assume  that  the  surface  in  view  is  planar  (as  in 
previous  shape  from  texture  research).  If  the  surface  in 
view  is  nonplanar,  then  the  problem  can  be  addressed 


either  by  applying  our  theory  locally  in  the  image,  i.e. 
assuming  that  the  surface  in  view  is  locally  planar,  or  if  a 
parametric  model  for  the  surface  is  assumed,  then  the 
same  basic  principles  reported  here  may  be  used  to 
recover  the  parameters  of  the  surface. 


4.1.  Prerequisites 

Suppose  that  the  camera  is  looking  at  a  planar  sur¬ 
face.  Assume  further  that  the  camera  is  moving.  For  our 
analysis  we  assume  that  the  surface  is  moving.  This  is 
equivalent  to  the  motion  of  the  camera,  and  it  is  done 
here  for  simplification  of  the  formulas.  Call  the  planar 
surface  in  the  world  W  and  the  image  plane  R .  Suppose 
that  point  X  =  (X,Y,Z)  6  W  is  projected  onto  point 
x  =  (x,y )  £  R .  Let  the  motion  of  the  surface  consist  of  a 
translation  T  =  (t1,t2,l3)  and  a  rotation  0  =  (w1,w2,w3), 
or  V(X)  =  T  +  0  X  X,  where  V(X)  is  the  velocity  of  a 
point  X  6  W.  Then  this  velocity  can  be  written  as 

V(X)=  £  rtV*(X),  Where 
k  =  1 

»T  =  *i.  T1(x)  =  (100)r 
r2=t2,  F2(x)  =  (0  1  0)r 
r3=*3>  ^3(x)  =  (0  0  l)r 
r„  =  uq,  V4(x)  =  (0  -Z  Y)t 
r5  =  w2,  V5(x)  =  (Z  0-X)r 


r6  =  w3,  F6(x)  =  (-y  X  0)T 

Then,  it  can  be  easily  proved  that  the  optic  flow  (image 

6 

velocity)  at  a  point  x  =  (x,y)  is  x=  £  rk  u*(x). 

*=i 

We  prove  the  above  equation  avoiding  the  details  of 
the  perspective  projection.  Let  the  projection  from  the 

world  to  the  image  plane  be  P,  with 
^*(X) =  x  =  (x  ) =  ( P ,(X),  P 2(X)) .  If  the  shape  of  the 
surface  W  is  given  (a  function  A),  a  mapping:  PA~*  :  R  — 
object  surface  is  defined  such  that  P(PA_1(x)}  =X. 

The  optic  flow  x  at  a  point  x  —  (x,y)  is  then  given 
by 


dP 

where  both  and  V  are  functions  of  retinal  coordi- 


6 

So,  since  V=  E  rAVt,  we  have 

*=i 


*  =  £  r*  u*  (*)  with 
t=i 


(4.1) 


u*  (x)=fr  (/v‘{x)J  v*  (vw). 

Equation  (4.1)  will  be  used  very  frequently  in  the  sequel. 


562 


4.2.  Linear  Features 

Here  we  introduce  the  concept  of  a  linear  feature 
vector  that  has  proved  to  be  a  strong  device  for  several 
problems  in  cybernetics  (Amari,  1978).  Let  the  image 
intensity  function  be  denoted  by  s(x,y).  A  linear  feature 
(LF)  is  a  linear  function  /  over  the  image,  i.e. 

/  =  ff  s(x,y)m(x,y)dx  dy  , 

where  m  is  called  a  measuring  function,  s  is  the  image 
brightness  and  the  integration  is  taken  over  the  area  of 
interest.  A  linear  feature  vector  f  (LFV)  is  a  vector  of 
linear  features,  i.e. 

f=  I/1/2  '  ’  1  fn\T  >  with 
/,  =  f  f  s  m,  dx  dy 

where  m,  is  a  measuring  function,  for  i  =1,  ...  ,  n. 
{m,}  could  be  any  set;  one  good  example  is 
{m,  }  =  {mpi  x  =  {e  '(*  +  ?y)},  in  which  case  a  linear 
feature  corresponds  to  a  Fourier  component  of  the  image. 

4.3.  The  Constraint 

Since  there  is  motion,  the  induced  optical  flow 
satisfies  the  following  equation  (approximately): 

sz  u  +  sy  v  +  st  =  0, 

where  (u,v)  is  the  optic  flow  at  a  point  {x,y)  and  sx,  sy , 
s,  are  the  spatiotemporal  derivatives  of  the  image  inten¬ 
sity  function  at  the  point  ( x  ,y ).  This  equation  can  be 
written  as 

ds 

—  =  -  x  •  V  s- 

The  time  derivative  of  an  LFV  will  be 

f=  \f1f2  Ll  where 

/,  =  f  f  -|y  dx  dy  =  -  ff m,  (x  v« )  dx  dy 

The  optic  flow  field  (from  equation  4.1)  can  be  written  in 
the  form 

6  6  (x)1 

x”,5,,*u*".5,r‘  UwJ 

From  this, 

fi  =-  f  f  m,  (x  v® )  dx  dy  or 

/,  =  -  £  r*  ff m,  (“t  sz  +»i*,)  dx  dy  or 


/.  =  £  Tk  hik  -  with 

hik  =  ff m,  (“*  «i  +  'Jk  *,)dx  dy 

So,  we  have  found  that 
t—H  r, 


where  H  =  (h,k )  and  r=(r,r2-  ■  •  r6)r,  the  motion 
parameters. 

Matrix  H  contains  the  parameters  of  the  plane  in  a 
linear  form.  So,  equation  (4.2)  relates  linear  features  with 
shape  and  motion  parameters.  Furthermore,  it  is  linear 
in  the  shape  of  the  planar  surface  in  view.  So,  a  simple 
linear  least-squares  method  or  a  Hough  transform  tech¬ 
nique  is  sufficient  for  the  recovery  of  the  gradient  of  the 

plane  in  view.  Depth  can  be  computed  too,  if  desired. 

Finally,  we  want  to  stress  here  the  fact  that  in  this 
algorithm,  the  spatial  derivatives  of  the  intensity  function 
don’t  need  to  be  computed.  This  is  due  to  the  linear 
feature  vector  approach.  (Integration  by  parts  avoids 
differentiation  of  the  intensity  function.  Instead,  the 
derivative  of  the  measuring  function  has  to  be  computed. 
So,  we  avoid  differentiating  the  image  intensity,  which  is 
discrete,  because  numerical  differentiation  is  an  ill-posed 
problem.)  More  importantly,  the  same  approach  can  be 
followed  if  the  image  is  a  dot  pattern  (or  a  line  pattern — 
zero  crossings),  i.e.  it  is  discontinuous.  The  reason  for 
this  is  again  the  fact  that  the  spatial  derivatives  of  the 
intensity  function  don’t  have  to  be  estimated.  Only  tiie 
temporal  derivative  of  the  image  needs  be  estimated.  This 
approach  has  been  initiated  in  (Ito  and  Aloimonos,  1987). 

4.4.  Summary  and  Discussion 

We  have  presented  a  method  for  the  recovery  of  the 
shape  of  a  planar  surface  by  an  active  observer.  Our 
method  does  not  rely  on  any  assumptions  about  the  tex¬ 
ture  and  it  does  not  require  the  image  to  be  spatially 
differentiable.  The  approach  is  based  on  the  fact  that  the 
observer  is  moving  with  a  known  motion.  If  the  observer 
is  moving  with  an  unknown  motion,  then  again  the  prob¬ 
lem  is  solvable,  and  it  has  been  addressed  by  Aloimonos 
and  Bandyopadhyay  (1986),  Negahdaripour  (1986),  and 
Amari  et  al.  (1985).  We  will  report  elsewhere  our 
research  on  this  case. 

5.  STRUCTURE  FROM  MOTION 
5.1.  Introduction 

The  problem  of  structure  from  motion  has  received 
considerable  attention  lately  (Ullman,  1979;  Longuet- 
Higgins  and  Prazdny,  1980;  Tsai  and  Huang,  1984).  Tim 
problem  is  to  recover  the  three-dimensional  motion  and 
structure  of  a  moving  object  from  a  sequence  of  its 
images.  Even  though  computation  of  structure  and  3-D 
motion  are  equivalent  when  the  retinal  motion  is  given, 
the  two  problems  have  received  different  names,  the 
former  “Structure  from  Motion”  and  the  latter  “Passive 
Navigation.”  We  will  refer  to  them  interchangeably. 

Basically  there  have  been  two  approaches  toward 
solving  this  problem.  The  first  assumes  “small”  motion. 

In  this  case,  if  the  three-dimensional  intensity  function 
(two  spatial  and  one  temporal  argument)  is  locally  well- 
behaved  and  its  spatiotemporal  gradients  are  defined, 
then  the  image  velocity  field  (or  optic  flow)  may  be  com¬ 
puted  (Horn  and  Schunck,  1981;  Hildreth,  1984).  Algo- 


563 


rithms  developed  using  this  approach  use  the  velocity 
field  to  compute  3-D  motion  and  structure.  The  second 
approach  assumes  that  the  motion  is  large  and  measure¬ 
ment  of  image  motion  entails  solving  the  correspondence 
problem.  Imaged  feature  points  due  to  the  same  three- 
dimensional  artifact  (e.g.  texture  element  or  edge  junc¬ 
tion)  in  two  successive  dynamic  frames  are  assumed  to  be 
identified  correctly.  Algorithms  using  this  approach  com¬ 
pute  3-D  transformation  parameters  from  the  above  men¬ 
tioned  displacements  field  (Tsai  and  Huang,  1984;  Roach 
and  Aggarwal,  1980;  Nagel  and  Neumann,  1981).  There 
is  a  third  approach,  that  computes  3-D  motion  directly 
from  brightness  patterns,  but  the  general  case  (unres¬ 
tricted  motion)  has  not  been  solved  yet  (Aloimonos  and 
Brown,  1984;  Negahdaripour  and  Horn,  1985). 

The  “small"  (continuous)  and  “large”  (discrete) 
motion  cases  are  slightly  different  in  terms  of  the  con¬ 
straints  that  relate  the  3-D  to  the  2-D  motion,  although 
the  results  are  essentially  the  same  (Fang  and  Huang, 
1984).  So,  we  concentrate  here  only  on  the  continuous 
case,  where  the  input  to  the  “structure  from  motion”  per¬ 
ceptual  process  is  the  optic  flow  field.  What  will  be 
developed  in  the  sequel  has  meaning  if  and  only  if  the 
optic  flow  (image  velocity)  is  the  projection  of  the  three- 
dimensional  motion.  We  state  this  explicitly,  because  the 
velocity  of  the  brightness  patterns  in  the  image  is  com¬ 
puted  (according  to  the  existing  literature)  using  some 
assumptions,  which  might  violate  the  fact  that  image 
velocity  is  the  projection  of  3-D  motion.  In  the  ease 
where  optic  flow  is  used  as  input,  there  is  some  excellent 
research  (Bruss  and  Horn,  1983;  Prazdny,  1981).  The 
basic  problems  of  this  passive  approach  are: 

1)  The  constraint  that  relates  3-D  to  2-D  motion  is 

nonlinear. 

2)  The  dimensionality  of  the  space  of  unknows  is  high 

(five,  if  one  camera  is  used). 

Sometimes,  closed  form  solutions  may  be  found 
(Longuet-Higgins  and  Prazdny,  1980),  but  higher  order 
derivatives  of  the  optic  flow  are  involved.  Thus,  eiven 
that  optic  flow  will  be  noisy  (there  is  no  algorithm  to 
date  that  can  compute  optic  flow  in  natural  scenes  with 
high  accuracy)  and  also  given  that  numerical 
differentiation  is  an  ill-posed  problem  (Poggio  et  al„ 
1986),  the  efficacy  of  these  approaches  is  questionable.  In 
this  section  we  prove  that  an  active  observer  solves  the 
structure  from  motion  problem  more  efficiently.  In  par¬ 
ticular,  an  active  monocular  observer  brings  down  the 
dimension  of  the  space  of  the  unknowns  from  five  to  four, 
but  he  does  not  get  rid  of  the  nonlinearity.  An  active 
binocular  observer  greatly  simplifies  the  constraints  used 
in  the  analysis  of  motion  and  permits  simple  closed  form 
solutions  of  the  resulting  parameter  equations. 

The  strategy  advocated  in  this  section  calls  for  visual 
tracking  and  here  the  formulation  is  the  one  in  (Bandyo- 
padhyay,  1986).  This  approach  is  suggested  as  a  possible 
way  to  overcome  the  difficulties  of  the  passive  monocular 
approach.  The  possible  employment  of  active  tracking  to 
facilitate  navigation  has  been  suggested  by  visual 


psychologists  (Cutting,  1982). 

It  will  be  shown  (Bandyopadhyay,  1986)  that  when 
the  observer  is  able  to  track  a  prominent  feature  point  in 
the  imaged  scene,  the  task  of  navigation  is  facilitated 
since  it  is  easier  to  compute  egomotion  parameters,  com¬ 
pared  to  the  non-tracking  case.  The  emphasis  in  this  sec¬ 
tion  is  on  the  mathematics  governing  the  imaging  equa¬ 
tions  that  are  obtained  while  the  system  is  tracking.  To 
track,  the  system  must  have  some  way  of  measuring  the 
error  in  the  retina!  signal. 

The  outline  of  this  section  is  as  follows: 

1.  Error  velocity  measurement  to  correct  tracking  drift 
is  discussed  in  light  of  the  primate  pursuit  system. 

2.  A  general  form  of  the  relation  between  3D  velocity 
parameters  and  retinal  optical  flow  is  derived.  In 
previous  derivations  of  this  relation,  the  origins  of 
the  body  centered  coordinate  frame  and  viewer  cen¬ 
tered  coordinate  frame  are  taken  to  coincide  at  the 
instant  of  measurement.  Using  the  general  represen¬ 
tation  it  is  shown  why  a  monocular  observer,  who  is 
able  to  track  an  environmental  feature  point,  has  to 
contend  with  a  smaller  number  of  velocity  parame¬ 
ters. 

3.  A  new  set  of  constraint  equations  are  derived  for  the 
tracking  observer,  which  allow  closed  form  solution 

of  the  egomotion  parameters.  Simulation  results  are 
described  and  implementational  issues  for  integrating 
this  module  into  the  overall  motion  interpretation 
scheme  are  discussed. 

5.2.  Target  Selection  Via  Velocity  Channels 

The  key  assumption  is  that  the  alignment  of  the 
camera  axes  is  controllable  by  the  system  itself.  In  this 
case,  as  the  system  moves  in  the  world,  the  orientation  of 
the  camera  is  continually  adjusted.  This  adjustment  is 
dependent  upon  the  two  dimensional  motion  perceived  on 
the  retina. 

In  the  tracking  system  the  problem  can  be  see  l  as: 
given  the  image  of  a  target  environmental  point,  to  gen¬ 
erate  control  signals  that  will  foveate  the  target.  The 
block  diagram  of  a  system  for  accomplishing  this  can  be 
schematized  as  shown  in  Figure  5.1.  The  first  and  most 
important  point  to  make  is  that  the  system  can  be  ade¬ 
quately  modeled  by  servomechanism  concepts.  It  is  rela¬ 
tively  easy  to  see  how  to  generate  the  kinds  of  motor 
commands  for  the  two  movement  system  to  produce  the 
observed  behavior.  This  of  course  assumes  that  the  tar¬ 
get  point  is  identified. 

Target  identification  is  a  central  issue:  in  a  compli¬ 
cated  motion  field,  how  can  the  target  velocities  be  easily 
identified?  This  is  a  basic  subproblem  in  tracking  using 
velocity  sensing  and  is  captured  by  Figure  5.2. 

Our  answer  to  this  question  uses  the  notion  of  global 
flow  field  vectors.  Such  vectors  respond  to  velocities  in 
every  part  of  the  optical  flow  field.  In  other  words,  if  we 
visualize  the  optic  flow  field  as  a  four  dimensional  param¬ 
eter  space  (*,  y ,  u  (x,  y ),  v  (x,  y )),  the  global  flow  field 


564 


sums  all  the  different  flow  vectors  in  a  two  dimensional 
( u,v )  parameter  space.  The  detectors'  sensitivities  are 
organized  into  channels.  In  the  case  of  a  particular  flow 
field,  some  channels  will  typically  respond  to  it  and  oth¬ 
ers  will  not.  Figure  5.3  shows  how  the  channel  concept 
can  be  utilized. 

We  claim  that  with  this  abstract  flow  channel  model 
the  problem  becomes  one  of  determining  which  of  the 
channels  should  be  used  for  the  eye  movement  control 
system.  This  means  that  a  mechanism  is  needed  to 
switch  the  appropriate  channels  into  the  servo  system. 
Note  that  this  technique  uses  a  spatially  distributed 

detector  array.  Our  contention  is  that  it  is  appropriate 
to  average  the  flow  field  over  this  subset. 

A  mechanism  to  switch  the  detectors  on  once  the 
appropriate  ones  have  been  identified  is  simple  to  under¬ 
stand,  so  we  will  concentrate  on  identifying  the  ideas 
behind  selecting  the  right  detectors.  The  general  way 
that  this  is  done  is  by  a  feed-forward  mechanism  that 
determines  some  selection  criterion.  The  different  kinds 
of  criteria  are  important,  so  it  is  useful  to  categorize 
them. 

1.  Extrinsic  features.  This  method  uses  some  other 
feature,  say  color,  that  also  has  spatially  organized 
detectors.  To  track  a  red  object,  the  detectors  that 
register  red  are  used  to  select  the  spatial  component 
of  the  velocity  detectors.  All  such  detectors  with  the 
appropriate  correspondence  are  used. 

2.  Intrinsic  features.  This  method  uses  some  particular 
range  of  values  for  the  flow  field  itself,  say  all  values 
over  a  certain  velocity  magnitude.  To  track  an 
object,  all  the  detectors  that  satisfy  the  intrinsic  cri¬ 
terion  are  switched  into  the  movement  control  sys¬ 
tem. 

These  distinctions  are  important  as  they  correspond 
to  two  different  types  of  tracking  situations.  In  naviga¬ 
tion,  where  the  entire  spatial  field  is  moving,  an  extrinsic 
feature  is  appropriate.  In  pursuing  a  small  target,  that 
target  is  usually  moving  differently  with  respect  to  the 
background,  so  an  intrinsic  feature  may  be  appropriate. 

5.3.  Measuring  Egomotion 

5.3.1.  Background 

Consider  first  the  monocular  imaging  situation  where 
a  sensor  is  moving  relative  to  a  static  scene.  The  coordi¬ 
nate  frame  (X,Y,Z)  is  fixed  to  the  sensor  (see  Figure 
5.4).  The  viewing  direction  is  along  the  positive  Z-axis. 

The  analysis  presented  here  assumes  a  rotating  and 
translating  observer  moving  in  a  static  environment. 
However,  since  the  velocity  parameters  characterize 
motion  relative  to  the  observer’s  frame  of  reference,  the 
analysis  per  se  is  not  affected  by  multiple  moving  objects. 
The  analysis  assumes  the  velocity  representation  for  the 
motion  parameters. 

The  reference  coordinate  frame  is  fixed  to  the 
observer.  There  is  another  coo- din  ate  frame  fixed  at  the 
point  S  on  the  body  (see  Figure  5.4).  The  point  S  has 


the  velocity  Ts  =  (Us,  Fs,  VV, ).  At  the  time  of  observa¬ 
tion  the  reference  and  the  body  frame  axes  are  parallel  to 
each  other.  The  rotational  velocity  of  the  body  is  given 
by  the  vector  0  —  (a,/?, 7).  The  3D  velocity  of  a  point 
P  =  (X,  Y ,Z)  on  the  body  is  given  by  the  equation 

X  =  T  +  [R](X-XS)  (5.1) 


where  Xs  —  {Xs ,  Ys,Zs )  denote  the  position  of  the  body 
origin  S,  and  X  denotes  the  3D  velocity  of  P  (the  ‘dot’ 
operator  is  used  throughout  to  signify  differentiation  with 
respect  to  time);  also 


[n\  = 


0  -7 

7  0  -  a 


I  -Pot  0  J 

Image  formation  is  modeled  by  perspective  projection. 
The  projection  of  a  point  P=(X,Y,Z)  is  denoted  by 
p  ==  (x  ,y ).  The  projective  relation  is 


(5.2) 


The  constant  /  is  the  focal  length  of  the  imaging  system. 
It  is  the  distance  separating  the  nodal  point  of  the  cam¬ 
era  (or  eye)  and  the  image  plane,  moving  along  the  opti¬ 
cal  axis  (i.e.  X  axis).  In  subsequent  steps  the  constant  / 
is  assumed  to  be  unity.  The  velocity  of  image  points  in 
the  2d  image  space  is  called  optical  flow.  The  relations 
between  the  2D  and  3D  velocities  are  obtained  by 
differentiating  the  equation  (5.2)  and  substituting  from 
equation  (5.1): 


When  the  origin  of  the  body  coordinate  frame  coincides 


with  the  reference  or  observer  coordinate  frame  then  Xs 
=  Ys  =  Zs  =  0,  and  T  =  T0  =  ( U,V,W ),  which 
simplifies  the  equation  for  optical  flow  to  give 

u  =  ^  -axy  +/?(j2+  l)-7y  (5.4.1) 

v  =  — — -  q(1  +  y2)  +  0xy  +  7  (5.4.2) 

Ci 


The  above  pair  of  equations  embodies  the  constraint  that 
the  optical  flow  (u,u)  imposes  upon  the  parameters  of 
rigid  motion.  Thus  all  an  observer  has  to  do  to  deter¬ 
mine  where  he  is  going  is  to  measure  the  retinal  velocity 
pattern  and  then  use  the  above  pair  of  equations  applied 
at  least  five  points  to  determine  the  3D  velocity  of 
egomotion.  Note  that  there  are  six  velocity  components 
(i.e.  three  for  translation  and  three  for  rotation).  How¬ 
ever,  all  six  paramticrs  cannot  be  computed  by  monocu- 


565 


iar  visual  data.  This  is  because  of  the  depth  term  Z  that 
occurs  in  the  above  pair  of  equations.  The  depth  intro¬ 
duces  a  scaling  effect,  whereby  other  things  being  equal, 
multiplying  the  translational  components  and  the  depth 
by  the  same  constant  factor  leaves  the  perceived  retinal 
motion  unchanged.  Thus  for  example  an  object  at  a  cer¬ 
tain  distance  translating  with  a  certain  speed  generates 
the  same  optical  flow  field  when  it  is  twice  as  far  away 
and  travelling  in  the  same  direction  with  twice  the  speed. 

The  monocular  observer,  lacking  depth  information, 
must  eliminate  the  depth  factor  from  the  optical  flow 
constraints.  This  will  then  imply  that  the  observer’s 
translation  can  only  be  determined  up  to  a  scale  factor. 
Thus  the  number  of  egomotion  parameters  of  interest  are 
five — pertaining  to  the  direction  of  translation  and  the 
rotation. 

When  the  depth  variable  is  eliminated  from  the  above 
equations  we  have 

xo~  x  _  u  +  axy  -  0[x2  +  1)  +  ny 
Vo ~y  v  +a(y2+  l)-0xy  -qrx 

where  (x0,y0)  =  ( rePresents  the  direction  of 

translation  of  the  observer’s  coordinate  frame. 

The  above  constraint  equation  demonstrates  the 
difficulty  of  motion  computation  for  a  monocular 
observer.  It  is  nonlinear  as  well  of  high  dimensionality; 
these  two  properties  in  conjunction  make  the  problem 
difficult  (Tsai  and  Huang,  1984). 

5.3.2.  The  Tracking  Advantage 

It  will  now  be  shown  that  in  case  the  monocular 
observer  can  discern  a  distinguishing  feature  or  mark  on 
the  observed  surface  then  the  perception  problem 
becomes  simpler.  Suppose  that  the  surface  in  view  has 
an  easily  distinguishable  and  localized  feature  at  point  S 
whose  corresponding  image  location  is  (xj,y$).  In  this 
case  we  can  shift  the  body  origin  to  the  point  S  and 
rewrite  the  optical  flow  equations  as  in  (5.3).  In  addition 

Us  ~  xs  Wy 
«(*s.ys)=«s  - - J - 

S  tr  r>  \ 


,  ,  Vs  ~Vs  Ws 

v(xx,l/s)  =  vx  = - - - 

“S 

Combining  equations  (5.6)  and  (5.3)  one  obtains 

«S  +(XS  ~*)WS  oixys-0{l+xxs)  +  iys 

u  = - - - + - y< - " 


(5.7.1) 


-axy  +  0{\  +  x2)-iy 


vS  +(Vs~y)ws  a(yys  +  l)-0xsy -ixs 

v= — f — + - - : 


(5.7.2) 


■  a(l  +  y2)  +  0xy  +  qx 


Note  that  the  translational  parameters  with  respect  to 
the  observer’s  frame  (i.e.  the  observer’s  actual  transla¬ 
tion)  are  related  to  the  body  centered  translational 
parameters  by 

U'=U£-0+~iys 

V'=  Vj'+a-qxs  (5.8) 

W'=  Wg-ays  +  0xs 

The  above  analysis  illustrates  the  fact  that  given  the  abil¬ 
ity  to  estimate  the  projected  velocity  of  a  localized 
feature  accurately,  the  constraint  equations  reduce  in 
dimensionality  by  one. 

A  similar  result  may  be  obtained,  as  can  be 
expected,  when  the  moving  observer  is  able  to  track  a 
single  feature  point  so  that  it  appears  stationary  on  the 
retina  at  position  (0,0).  In  this  case  we  assume  that  the 
tracking  motion  consists  of  rotations  about  the  axes  that 
are  orthogonal  to  the  line  of  sight  or  the  optical  axis  of 
the  lens.  The  tracking  motion  is  a  rotation  (u>z,u>y,0), 
which  is  superimposed  upon  the  actual  parameters  of 
motion. 

Let  S  =  (0,0, Z0)  be  the  spatial  coordinates  of  the 
point  being  tracked.  Assume  that  the  observer  can  track 
an  environmental  point  and  hold  it  steady  on  the  optical 
axis  (Z  axis).  Therefore  the  optical  flow  Reid  will  have  a 
singularity  at  the  origin  of  the  retinal  frame,  where  the 
flow  value  is  zero.  At  the  time  of  observation,  the 
tracked  point  tends  to  move  along  the  observer’s  optical 
axis  (Figure  5.5). 

Consider  an  observer  moving  with  translation 
( U,V,W )  and  rotation  (a,0,~i).  Then,  if  the  body  frame 
origin  is  taken  to  be  at  5,  from  equation  (5.8),  remember¬ 
ing  that  Us  =  =0: 

c 


V''  =  —  A 


WS=W 

Furthermore,  the  optical  flow  equation  (5.3)  becomes 


■Axy  +^1-  ~Y  +a;2j 
j^l-^  +  y2J  +Bxy 


rV  8 

where  the  prime  signifies  scaling  by  Zs,  i.e.  W5*  =  — — . 


v  =  ^|lL-A^l--y  +  y2J  +f*ry+qx 

where  A  =  a  +  «,  and  B  =  0  +  . 

Eliminating  Z  from  the  above  we  have 

v.  +  Axy  -B(x2+  l)  +  7y  _  B  +  xW' 
v  +  A  (1  -1-  y2)-  Bxy  -qx  A-yW' 

where  W'  = 

Zo 

The  constraint  equations  derived  above  are  similar  in 
form  to  equation  (5.5).  However,  in  this  case  the  dimen¬ 
sionality  of  the  parameter  space  has  been  reduced  from 


566 


five  to  four  without  increase  in  the  degree  of  nonlinearity 
of  the  constraint.  It  is  important  to  note  that  the 
observer  can  determine  his  direction  of  translation  since 
from  equation  (5.9)  we  have 

z0 

V"=4—A 


Thus  even  without  explicitly  measuring  his  tracking 
motion,  the  observer  can  determine  the  scaled  translation 
{U',V',W'). 

5.4.  Binocular  Tracking 

The  optical  axes  of  the  two  cameras  converge  onto  a 
point  in  the  environment  that  is  being  tracked.  The 
geometry  is  illustrated  in  Figure  5.7.  It  will  be  shown 
that  the  tracking  velocities,  which  are  assumed  to  be 
observable,  can  be  used  to  compute  egomotion  parame¬ 
ters.  We  will  also  assume  a  rectilinearly  moving  observer, 
with  negligible  instantaneous  acceleration  rates.  The 
analysis  generally  deals  with  the  left  coordinate  frame, 
with  respect  to  which  the  various  quantities  will  be  writ¬ 
ten  as  in  the  monocular  case.  When  it  is  needed  to  refer¬ 
ence  the  quantities  with  respect  to  the  right  frame  these 
will  be  written  primed  (e.g.  a:').  The  tracking  motion 
involves  three  independent  rotational  velocities  (w,cjy,Wj). 
The  rotation  u>  is  about  the  baseline  RL  of  the  imaging 
system  (Figure  5.7).  Hence  the  tracking  motion  of  the 
left  frame  is  given  by  (u>x  =-wsinfl, wy,uj2  =  u/cosO).  If 
Z0  is  the  depth  of  the  tracked  point  in  the  left  frame 
then 


2d 


Zo 


7' 


sin(0  +  O') 
Thus  we  can  write 


sin  (O')  sin(0) 


Also 

F 


j, _ 2^  fl'cosfl'si 


'sin(fl  +  O')  -  (0  +  Q')s\nO'cos(Q  +  0 


sin2(0  +  O') 


1 


which  simplifies  to 
F 


=  2  T^ng  feing'cos(^gl)'| 

|_  sin2(0  +  0')  J 

Differentiating  the  above  relation  with  respect  to  time 
once  more,  we  have 

F  =  ,  [  O'cosO'  (g  +  g'Ving'cosfg  +  fl')] 

|^sin(«  +  «')  sin2(0  +  O')  ! 

.  ,,  ,f(0  dosing' -(flfsinl'l 

[  sin {0  +  0')  J 

.  .  T  (0  +  O^O'cosO'cosfO  +  O')  (0  +  <n2sin0'cos(0  +  O')  ] 

_  44  [ - ) - - — J 


Simplifying  the  above  leads  to 

fl'sinfl-  gsinfl'cos(g  -f  O' ) 
sm2(0  +  O') 


-2{0+0')cos  (0  +  0’)F  (5.16) 

Let  the  motion  of  the  observer  be  described  by  the  trans¬ 
lational  velocity  T  =  (U,V,W)  and  rotational  velocity 
fi  =  (a,/?,^).  These  parameters  are  defined  with  respect 
to  the  point  L  in  the  body,  which  also  happens  to  be  the 
origin  of  the  left  coordinate  frame.  The  tracking  motion 
of  the  system  consists  of  three  independent  rotations  with 
respect  to  the  observer.  These  three  rotations  correspond 
to  the  three  motors  in  Figure  5.1.  The  angular  velocity  w 
corresponds  to  the  rotation  of  the  plane  PRL  about  the 
axis  LR .  The  other  two  angular  velocities  are  0  and  O', 
which  affect  the  left  and  right  coordinate  frames  respec¬ 
tively.  Let  the  sense  of  uj  be  positive  in  the  direction 
from  L  to  R .  Then  the  tracking  angular  motion  of  the 
left  frame  with  respect  to  the  observer  is  given  by  tot  and 
that  of  the  right  frame  is  w/.  Note  that  we  will  the 
express  all  the  motion  parameters  measured  in  a  frame 
with  respect  to  basis  vectors  defined  in  that  frame. 
Therefore 

w,  =  (-  wsintf,  0,  u>cos0) 

w/  =  (-  oJSinB',  O',  -  wcos O') 


If  the  rigid  motion  parameters  with  respect  to  the  right 
coordinate  frame  are  given  by  the  translational  velocity 
T'  and  the  rotational  velocity  fi'  then  we  have 

T'  =  Rx(T  +  QXp) 

n'=Rxn 


where  X  denotes  the  vector  product  and  •  denotes  matrix 
multiplication.  In  addition,  the  rotation  matrix  R  x 
expresses  the  transformation  due  to  the  rotation  by 
X  =  7 t-(0  +  O')  between  the  left  and  right  frames  and  is 


cosX  0  -sinX 
0  1  0 
sinX  0  cosX 


Now  from  equation  (5.9)  we  have 


F(t)=  W 


U  +(0  +  ut)F(t)  =  O  (5.17) 

V-{a  +  wt)F{t)  =  0 


Observe  that  the  above  equations  involve  five  unknown 
motion  parameters.  If  we  now  differentiate  these  equa¬ 
tions  we  have 


F(t)=  W 

U  +  0F(t )  +  0F(t )  +  uij  F(t )  +  F(t )  =  0  (5.18) 
V-aF(t)-aF(t)-wlF(t)-w,F(t)  =  0 


567 


Although  here  we  consider  a  rectilinearly  moving 
observer,  the  translational  velocity  ( U,V,W )  undergoes 
change  due  to  the  rotation  of  the  frame  in  which  the 
observations  are  made.  Thus  we  obtain 

U  —  (fi+w,)  W-fr  +  wt)V 

F  =  (a  +  w1)W  +  ('H-w2)l/ 

W  =  (a  +  ux)V-(0+cj,)U 

Similarly  the  rotational  velocity  (a,/?,')'),  undergoes 
change  due  to  the  tracking  motion,  as  follows: 

a  =  u,7-ut0 

/3  =  -ul'l  +  ujla 

"f  =  0 

Introducing  the  parameters  A.  =  a  +  u>z,  B  =  0  +  u>f  and 
C  =  Tr  +  Wj,  substituting  for  U,  V,  VV,  a  and  0  from  the 

above  relations,  and  replacing  U  and  V  from  (5.17),  we 
have,  from  the  last  two  equations  in  (5.18) 

2BF{t)  +  A  uzF(t)  +  u,  F(t) 

(A  -fWj)F(t) 

and 

2  AF(t)-Bu,F(t)  +  u,F(t) 

C=  -(fl +«,)/•(*) 

Finally  eliminating  C  from  the  above  pair  of  equations 
and  using  the  remaining  equation  of  (5.18)  we  obtain  the 
pair  of  independent  equations 

A2  +  fl2~7(0 =  </>l  (5-19-1) 

4>iA  +  <j>3B  +^4  =  0  (5.19.2) 

where 

<j>2  —  2uixF(t  )F(t )  4-  u>jF2(t )  +  u>fuizF2(t ) 

43  =  2u>tF(t)F(t)  +  u,F2(t)-u,tuzF2(t) 

^  =  +  utu9)F2{t)  +  *r'(t)F{t) 

From  equation  (5.19)  we  obtain  two  sets  of  solutions  for 
the  motion  parameters.  Eliminating  the  parameter  B  we 
have 

aA2  +  b  A  +  e  =  0  (5.20) 

where  a  —  <t>2  +  <t>2,  b  =  202<^  and  c  =  <j>2  -  <j>x  <j>32. 

To  summarize,  the  solution  method  consists  of  obtaining 
the  solutions  to  the  pair  of  equations  (5.19.1)  and  (5.19.2). 
Since  closed  form  solutions  are  obtained  at  every  time 
instant  and  assuming  the  computation  errors  to  be  uni¬ 
formly  random,  we  perform  smoothing  on  the  time  series 
of  the  computed  parameters,  to  eliminate  a  large  portion 
of  the  error. 

The  important  aspects  of  this  method  of  computa¬ 
tion  of  the  motion  parameters  are  as  follows: 

(a)  The  solution  is  in  closed  form,  requiring  no  iteration 
or  search. 


(b)  The  constraints  are  derived  from  the  observed  track¬ 
ing  velocities  and  rotations.  We  do  not  need  the 
optical  flow  measurements. 

(c)  Here  the  observables  are,  (9,  9',  8, 6',  9,  9').  These  can 
be  measured  quite  accurately  by  analog  measure¬ 
ment  apparatus.  This  possibility  forms  a  strong 
motivation  for  the  tracking  approach. 

(d)  The  optical  flow  field  in  our  motion  perception 
scheme  is  only  used  to  disambiguate  between  the 
possible  interpretations  computed  by  the  tracking 
module.  This  is  always  possible  since  under 
extended  periods  of  observation  the  optical  flow  field 
generated  is  compatible  with  one  and  only  one 
interpretation  (Bandyopadhyay  1986). 

Simulation  experiments  based  on  the  above  analysis 

show  the  approach  to  be  robust  against  noise.  The  results 
are  illustrated  in  Figure  5.9,  where  one  of  the  rotational 
parameters  is  plotted  against  time.  The  values  plotted 
are  those  obtained  after  applying  a  temporal  smoothing 
filter  to  the  solution  of  equation  (5.20).  These  parameter 
values  are  suitable  for  use  as  an  initial  estimator  in  a 
cooperative  optical  flow  and  structure  computation  algo¬ 
rithm.  Basically,  the  optic  flow  and  structure  computa¬ 
tion  are  strongly  interdependent.  An  independent  compu¬ 
tational  mechanism  for  motion  parameter  estimation  can 
guide  the  motion  correspondence  process  and  enable  the 
computation  of  structure.  Namely,  in  our  method,  motion 
parameters  are  computed  without  using  optical  flow,  and 
then  these  parameters  along  with  the  intensity  profile 
enable  us  to  uniquely  compute  optic  flow  which  in  an 
unrestricted  case  is  essential  for  the  computation  of  struc¬ 
ture. 

6.  CONCLUSIONS  AND  FUTURE  RESEARCH 

We  have  proposed  a  new  paradigm  for  visual  percep¬ 
tion  called  active  vision.  The  idea  of  using  active  vision 
to  address  the  structure  from  motion  problem  was  intro¬ 
duced  by  Bandyopadhyay  (1986).  Here  we  have 
addressed  several  other  problems  such  as  shape  from 
shading,  texture,  contour  and  depth  computation.  Our 
methodology  was  demonstrated  by  showing  its  applicabil¬ 
ity  in  these  important  areas  of  low  and  intermediate  level 
vision.  It  was  shown  that  the  controlled  alteration  of 
viewing  parameters  yields  stable  and  robust  algorithms 
for  shape  and  motion  perception. 

The  basis  for  the  approach  lies  in  being  able  to  work 
in  a  rich  stimulus  domain  with  a  partially  known 

parametrizatioii.  This  knowledge  is  due  to  the  fact  that 
the  viewing  transformation  is  known.  As  the  viewing 
parameters  are  continuously  varied,  the  observed  visual 
stimuli  undergo  local  transformations  that  are  measurable 
and  provide  powerful  constraints  for  the  computation  of 
the  unknown  scene  parameters.  We  are,  of  course, 
interested  in  the  rates  of  these  stimulus  •  i.anges,  which 
have  traditionally  been  thought  to  be  difficult  to  measure. 
However,  it  should  be  pointed  out  that  in  the  present 
scheme  we  do  not  work  with  a  small  set  of  discrete  obser- 


568 


vations,  but  with  trajectories  in  the  stimulus  space, 
termed  flow  lines.  These  trajectories  are  smooth,  since 
the  viewing  transformations  we  use  are  themselves 
smooth,  and  therefore  can  be  computed  accurately 
enough  for  our  purposes.  Thus  we  do  not  need  to  rely  on 
the  smoothness  of  properties  of  the  observed  scene,  such 
as  illumination  and  depth.  The  real  power  of  the  method 
is  due  to  the  avoidance  of  complications  usually  associ¬ 
ated  with  multiview  approaches  to  visual  perception.  For 
instance,  the  problem  of  correspondence  of  microfeatures 
is  not  involved  in  the  current  approach. 

The  treatment  in  this  paper  has  been  largely  theoret¬ 
ical,  because  we  want  to  create  a  sound  framework  for 
what  we  believe  is  a  promising  methodology  for  computer 
vision.  We  plan  to  verify  the  analysis  with  experiments 
on  synthetic  and  natural  images.  Research  going  on 
currently  at  the  University  of  Rochester  (Brown,  1986) 
and  at  SUNY  Stony  Brook  (Bandyopadhyay,  1987)  aims 
toward  developing  an  active  vision  system  that  will  be 
able  to  track  objects  in  real  time.  On  the  other  hand,  we 
plan  to  continue  our  theoretical  analysis  of  active  visual 
computations,  before  we  build  a  system  with  possibly  spe¬ 
cial  hardware  that  will  carry  out  active  visual  computa¬ 
tions.  The  initial  developments  of  the  theory  of  active 
vision  have  been  outlined.  There  are  many  important 
issues  that  need  to  be  explored  in  this  context.  For 
instance,  we  have  not  looked  at  the  question  of  whether 
there  is  any  viewing  parameter  trajectory  that  is  prefer¬ 
able  to  other  trajectories  in  tackling  the  computational 
task.  This  could  be  termed  exploratory  visual  computa¬ 
tion,  where  the  knowledge  about  scene  constraints  avail¬ 
able  at  any  stage  of  the  visual  process  determ’  aes  the 
way  in  which  the  control  parameters  will  be  altered.  Also, 
a  theoretical  error  analysis  of  the  proposed  algorithms  has 
to  be  done,  taking  into  account  discretization  effects  and 
noise  in  images.  Finally,  the  problem  of  learning  the 
viewing  parameter  trajectory  that  is  the  best  (in  compu¬ 
tational  terms)  with  respect  to  a  particular  problem, 
using  a  neural  (connectionist)  network,  has  to  be 
addressed.  Examination  of  such  issues  forms  an  impor¬ 
tant  future  research  goal. 

REFERENCES 

Aloimonos,  J.,  “Computing  intrinsic  images”,  Ph.D. 
thesis.  University  of  Rochester,  1986. 

Aloimonos,  J.  and  Bandyopadhyay,  A.,  "Correspondence 
is  not  necessary  for  motion  perception”,  submitted. 

Aloimonos,  J.  and  Basu,  A.,  “Shape  from  contour”,  sub¬ 
mitted,  1986. 

Aloimonos,  J.  and  Brown,  C.M.,  “Direct  processing  of 
curvilinear  sensor  motion  from  a  sequence  of  perspec¬ 
tive  images”,  IEEE  Workshop  on  Computer  Vision, 
Annapolis,  MD,  72-77,  1984. 

Amari,  S.,  “Feature  spaces  which  admit  and  detect 
invariant  signal  transformations”,  Proc.  ICPR, 
Kyoto,  Japan,  452-458,  1978. 

Amari,  S.  and  Maruyama,  S.,  “Computation  of  structure 
from  motion",  personal  communication,  1986. 


Bandyopadhyay,  A.,  personal  communication,  1987. 

Bandyopadhyay,  A.,  “A  computational  study  of  rigid 
motion  perception”,  Ph.D.  thesis,  Department  of 
Computer  Science,  University  of  Rochester,  1986. 

Brady,  J.  and  Yuille,  A.,  “An  extremum  principle  for 
shape  from  contour”,  IEEE  Trans,  on  PAMI,  6, 
288-301,  1984. 

Brown,  C.M.,  personal  communication,  1986. 

Bruss,  A.  and  Horn,  B.K.P.,  “Passive  navigation”,  Com¬ 
puter  Vision,  Graphics  and  Image  Processing,  21, 
3-20,  1983. 

Cutting,  J.  E.,  “Motion  parallax  and  visual  flow:  how  to 
determine  direction  of  locomotion”,  Dept,  of 
Psychology,  Cornell  University,  1982. 

Davis,  L.,  Janos,  L.  and  Dunn,  S.,  “Efficient  recovery  of 
shape  from  texture”,  TR-1133,  Computer  Vision 
Laboratory,  University  of  Maryland,  1982. 

Fang,  J.Q.  and  Huang,  T.S.,  “Solving  three  dimensional 
small  rotation  motion  equations:  uniqueness,  algo¬ 
rithms,  and  numerical  results”,  Computer  Vision, 
Graphics  and  Image  Processing,  26,  183-206,  1984. 

Gibson,  J.J.,  The  Perception  of  the  Visual  World, 

Houghton  Mifflin,  Boston,  1950. 

Hadamard,  J.,  Lectures  on  the  Cauchy  Problem  in  Linear 
Partial  Differential  Equations,  New  Haven:  Yale 
University  Press. 

Hildreth,  E.C.,  “Computations  underlying  the  measure¬ 
ment  of  visual  motion”,  Artificial  Intelligence,  23, 
309-354,  1984. 

Horn,  B.K.P.,  Robot  Vision,  McGraw-Hill,  1986. 

Horn,  B.K.P.,  “Understanding  image  intensities”. 
Artificial  Intelligence,  8,  201-231,  1977. 

Horn,  B.K.P  and  Schunck,  B.,  “Determining  optical 
flow”,  Artificial  Intelligence,  17,  185-204,  1981. 

Ikeuchi,  K.  and  Horn,  B.K.P.,  “Numerical  shape  from 
shading  and  occluding  boundaries”.  Artificial  Intelli¬ 
gence,  17, 141-184,  1981. 

Ito,  E.  and  Aloimonos,  J.,  “Determining  transformation 
parameters  from  images:  theory  ’,  to  appear  in  Proc. 
IEEE  Conference  on  Robotics  and  Automation,  1987. 

Ito,  E.  and  Aloimonos,  J.,  “Shape  from  nonplanar  con¬ 
tour”,  to  appear,  1987a. 

Kanade,  T.,  “Recovery  of  the  shape  of  an  object  from  a 
single  view”.  Artificial  Intelligence,  17,  409-460, 
1981. 

Kanatani,  K.,  “Group  theoretical  methods  in  image 
understanding”,  TR-1692,  Computer  Vision  Labora¬ 
tory,  University  of  Maryland,  1986. 

Kirkpatrick,  S.,  Gelatt  Jr.,  C.D.  and  Vecchi,  M.P., 
“Optimization  by  simulated  annealing”,  RC  9355 
(#41093),  IBM  T.J.  Watson  Research  Center,  York- 
town  Heights,  NY. 

Lee,  D.  and  Pavlidis,  T.,  personal  communication,  1986. 

Longuet-Higgins,  H.C.,  “A  computer  algorithm  for  recon¬ 
structing  a  scene  from  two  projections",  Nature, 
293,  133-135,  1981. 

Longuet-H'.ggins,  H.C.  and  Prazdny,  K.,  The  interpreta¬ 
tion  of  a  moving  retinal  image",  Proc.  Royal  Soc. 
London  B  ,  208,  385-397,  1980. 


569 


Marr,  D.,  Vision,  W.H.  Freeman,  San  Francisco,  1082. 
Morozov,  V.A.,  Regularization  Methods  for  Solving  Ill- 
Posed  Problems ,  Springer- Verlag,  1084. 

Nagel,  H.H.  and  Neumann,  B.,  “On  3-D  reconstruction 
from  two  perspective  views”,  Proc.  IJCAI,  661-663 
1081. 

Nalwa,  V.,  “Detecting  edges  in  images”,  personal  com¬ 
munication,  1085. 

Negahdaripour,  S.  and  Horn,  B.K.P.,  “Determining  3-D 
motion  of  planar  objects  from  image  brightness  pat¬ 
terns”,  Proc.  IJCAI ,  Los  Angeles,  CA,  808-001, 
1085. 

Poggio,  T.  and  Koch,  C.,  “Ill-posed  problems  in  early 
vision:  from  computational  theory  to  analog  net¬ 
works”,  Proc.  Royal  Soc.  London  B,  226,  303-333, 
1085. 

Poggio,  T.  and  the  staff,  “MIT  progress  in  understanding 
images”,  Proc.  Image  Understanding  Workshop, 
Miami,  FL,  25-30,  1085. 

Prazdny,  K.,  “Determining  the  instantaneous  direction  of 
motion  from  optical  flow  generated  by  curvilinearly 
moving  observer”,  Computer  Vision,  Graphics  and 
Image  Processing,  17,  04-07,  1081. 

Roach,  J.W.  and  Aggarwal,  J.K,  “Determining  the  move¬ 
ment  of  objects  from  a  sequence  of  images”,  IEEE 
Transactions  on  PAMl,  2,  554-562,  1080. 

Stevens,  lv.,  “The  information  content  in  texture  gra¬ 
dients”,  Biological  Cybernetics,  42,  05-105,  1081. 
Terzopoulos,  D.,  “Regularization  of  inverse  problems 
involving  discontinuities”,  IEEE  Transactions  on 
PAMI,  8,  413-425,  1086. 

Tichonov,  A.N.  and  Arsenin,  V.Y.,  Solutions  of  Ill-Posed 
Problems,  Winston,  Washington,  1077. 

Tsai,  R.Y.  and  Huang,  T.S.,  “Uniqueness  and  estimation 
of  three  dimensional  motion  parameters  of  rigid 
objects  with  curved  surfaces",  IEEE  Transactions  on 
PAMI,  6,  13-27,  1084. 

Ullman,  S.,  “The  interpretation  of  structure  from 
motion”,  Proc.  Royal  Soc.  London  B,  203,  405-426, 
1079. 

Waxman,  A.  and  Ullman,  S.,  “Surface  structure  and  3-D 
motion  from  image  flow:  a  kinematic  analysis”, 
CAR-TR-24,  Center  for  Automation  Research, 
University  of  Maryland,  October  1983. 

Witkin,  A.,  “Recovering  surface  orientation  and  shape 
from  texture”,  Artificial  Intelligence,  17,  17-45, 
1981. 

APPENDIX  1 

Theorem: 

Let  a  coordinate  system  OXYZ  be  fixed  with  respect 
to  the  left  camera,  with  the  Z  axis  pointing  along  the 
optical  axis.  We  assume  that  the  image  plane  Im,  is 
perpendicular  to  the  Z  axis  at  the  point  (0,0,1)  and  that 
O  is  the  nodal  point  of  the  left  camera.  Let  the  nodal 
point  of  the  right  camera  be  the  point  ( R,L,0 )  and  let 
its  image  plane  be  identical  to  the  previous  one,  i.e. 
Im,  =  Im2.  Consider  a  polygon  P  on  the  world  plane 
Z  =  pX  +  qY  +  c,  defined  by  the  points  ,Zt), 


i  =  l,  ...  ,  n,  and  having  area  5^.  Let  S j,  S2  the 
areas  of  the  paraperspective  projections  of  P  on  the  left 
and  right  cameras  respectively  and  S{,  S2  the  areas  of  the 
perspective  projections  of  the  polygon  P  on  the  left  and 
right  cameras  respectively.  Then 


Proof: 

The  proof  is  given  in  several  parts. 

Let  (AVB j)  and  ( A2,B2 )  the  centers  of  mass  of  the  pro¬ 
jections  of  the  contour  P  on  the  left  and  right  image 
planes  respectively  (it  has  to  be  noted  that  (Ai,2?i)  and 
(A2,B2)  are  the  centers  of  mass  of  the  actual  left  and 
right  images  as  opposed  to  the  projections  of  the  center 
of  mass  of  P  onto  the  left  and  right  image  planes).  Then 
we  have 

§2  1  -A2p  -  B2q 

5,  1  -Alp-Blq 

The  above  equation  is  equation  (6.17),  which  we  will 
prove  to  be  exact  under  perspective  projection. 


and  Ao—  —  Y\ 
n 


X.-R 


,  B2  =  -£ 

n 


Substituting  in  (1)  we  get  after  some  tedious  manipula¬ 
tions 

=  (2) 
Si  c 

On  the  other  hand,  we  can  easily  prove  that 


=  1  +R 


Y,  -  Y,  +  , 
-  1 


-Y,  -  Xl+, 

Zi+1  (3) 


,,  „[  -t  I-, 

A'-S| - 

We  can  also  easily  prove  that 

M  c 


X,  X,+i 


A,  -  A 


From  equations  (2),  (3),  (4),  (5)  and  (6)  the  proof  of 
the  theorem  is  immediate. 


570 


APPENDIX  2 


Theorem: 

With  the  nomenclature  of  the  previous  theorem 
(Appendix  1)  and  assuming  only  horizontal  displacement 
of  the  cameras,  if  SL ,  SR  denote  the  areas  of  the  left  and 
right  images,  and  the  world  plane  from  which  the  image 
contour  is  obtained  is  2  =  pX  +  gY  +  c  ,  then 

SL  _  1 

l  +  ±p 

C 

(dx  —  displacement  between  camera  1  and  camera  2). 

Proof: 


si=|(E^ vt+i  - E *.>i  vh  r 
i  - E 1  v?)  L 


Area  of  a 
polygon 


where  ( xL  ,yL ),  (xR,yR)  are  the  coordinates  in  the  left 
and  right  image  planes,  respectively. 

Now  yta  =  y,R  0) 


and  x, 


R  /(*?-<*)  L  fdx 


E  vt  -4r—  -  E  vhi  4  =  Up  *t)  vLi  -  Up  )  vt 

A  +  l  A 


Z,  =  p  xtL  +  q  y,L  +  c 
„  c  ,  xt  ,  vt 

=*>~z~  =  PT  T 


fc  ,  1  xt  f  Vi  f  L  L 

■j7=sf-p—-y—a=f-p *r-« 


E  -  E  Vi+i  4 

A+l  A 

=  E  vHf-p  xKi  -  9  vh-i ) -  E  vf+ i  (/  -  p  xt-q  vh 

= /( E^  -  Eft+i )  +  (  E(p  -E(p  *hi )  ) 


Since, 

dES'/'J'&i  -Es.'+i  P.4)  by  i 


ice,  3/„+i  =  yi 
by  notation 


=  E  (p  xh  vK\  -  E  (p  */Vi )  v.L 


-~(E*,^i  -  E  xK\  vt) 
jr(Ex*y*+i  -£x"+i  «.") 


Thus  we  have  proved  the  theorem. 


571 


Y 


a.  b. 


The  tracking  system  must  use  velocities  that  stem  from 
the  object  being  tracked  and  ignore  background  veloci¬ 
ties.  (a)  Shows  an  initial  situation  where  a  target  is  mov¬ 
ing  on  the  retina,  (b)  Once  the  tracking  system  is 
engaged,  the  target  is  moving  with  a  relative  velocity 
near  zero  but  the  background  has  a  large  signal. 


Figure  5.2  Target  Identification 


Figure  5.4  Imaging  Geometry  and  Motion  Representation 


*  b 


The  global  velocity  space  registers  the  number  of  flow 
vectors  with  certain  values.  Channels,  shown  in  the 
figure  as  concentric  annular  regions,  allow  ranges  of  velo¬ 
cities  to  be  selectively  ignored,  (a)  Initially  the  low  velo¬ 
city  channel  is  off  (shaded)  allowing  the  system  to  selec¬ 
tively  register  a  moving  target  (•)  and  ignore  background 
variations  (o).  (b)  Once  the  tracking  mechanism  is 

activated  the  high  velocity  channels  are  blocked  and 
again  the  target  velocities  are  passed,  ignoring  the  back¬ 
ground  signals. 

Figure  5.3  Concept  of  the  Velocity  Channel 


Y 


Velocity  in  depth  of  S  is 
v=  (0.0.W) 


Figure  5.5  Monocular  Tracking 


572 


AN  INTEGRATED  SYSTEM  THAT  UNIFIES  MULTIPLE  SHAPE  FROM  TEXTURE  ALGORITHMS 


Mark  L.  Moerdler  and  John  R.  Kender1 


Department  of  Computer  Science,  Columbia  University  .New  York,  N.Y.  10027 


ABSTRACT 

This  paper  describes  an  approach  which  intelligently  integrates 
several  conflicting  and  corroborating  shape-fiom-texture  methods  in 
a  single  system.  The  system  uses  a  new  data  structure,  augmented 
texels,  which  combines  multiple  constraints  on  orientation  in  a 
compact  notation  for  a  single  surface  patch.  The  augmented  texels 
initially  store  weighted  orientation  constraints  that  are  generated  by 
the  system’s  several  independent  shape-from-texture  components. 
These  texture  components,  which  run  autonomously  and  may  tun  in 
parallel,  derive  constraints  by  any  of  the  currently  existing  shape- 
from-texture  approaches  e.g.  shape-from-uniform-texel-spacing. 
For  each  surface  patch  the  augmented  texel  then  combines  the 
potentially  inconsistent  orientation  data,  using  a  hough  transform¬ 
like  method  on  a  tesselated  gaussian  spheres,  resulting  in  an  estimate 
of  the  most  likely  orientation  for  the  patch.  The  system  then  defines 
which  patches  are  part  of  the  same  surface,  simplifing  surface 
reconstruction. 

This  knowledge  fusion  approach  is  illustrated  by  a  system  that 
integrates  information  from  two  different  shape-from-texture 
methods,  shape-from-uniform-texel-spacing  and  shape-from- 
uniform-texel-size.  The  system  is  demonstrated  on  camera  images 
of  artificial  and  natural  textures. 


1  INTRODUCTION 

This  paper  proposes  a  new  approach  to  the  problem  of 
defining  and  reconstructing  surfaces  based  on  multiple  independent 
textual  cues.  The  generality  of  this  approach  is  due  to  the  interaction 
between  textural  cues,  allowing  the  methodology  to  extract  shape 
information  from  a  wider  rang£vof  textured  surfaces  than  any 
individual  method.  The  method,  as  SJvjwn  in  figure  1,  consists  of 
three  major  phases,  the  calculation  of  orientation  constraints  and  the 
generation  of  texel  patches2,  the  consolidatiotKof  constraints  into  a 
"most  likely"  orientation  per  patch,  and  finally  tffesreconstruction  of 
the  surface. 

During  the  first  phase  the  different  shape-from'-iJiture 


capture 


'This  research  was  supported  in  part  by  ARPA  grant  IN00039-84-C-0165,  by  a 
NSF  Presidential  Young  Investigator  Award,  and  by  Faculty  Development  Awards 
from  ATAT,  Ford  Motor  Co.,  and  Digital  Equipment  Corporation. 

2 A  texel  patch  is  a  2-D  description  of  a  subimage  that  contains  one  or  more 
textural  elements.  The  number  of  elements  that  compose  a  patch  is  dependent  on  the 
shape-from-texture  algorithm. 


components  generate  texel  patches  and  augmented  texels.  Each 
augmented  texel  consists  of  the  2-D  description  of  the  texel  patch 
and  a  list  of  weighted  orientation  constraints  for  the  patch.  The 
orientation  constraints  for  each  patch  are  potentially  inconsistent  or 
incorrect  because  the  shape-from  methods  are  locally  based  and 
utilize  an  unsegmented,  noisy  image. 


Figure  1:  Integrating  multiple  shape-from  methods 

In  the  second  phase,  all  the  orientation  constraints  for  each 
augmented  texel  are  consolidated  into  a  single  "most  likely" 
s  orientation  by  a  Hough-like  transformation  on  a  tesselated  Gaussian 
''-Sphere.  During  this  phase  the  system  will  also  merge  together  all 
augmented  texels  that  cover  the  same  area  of  the  image.  This  is 
necessary,  because  some  of  the  shape-from  components  define 
"texel"  simUarly,  and  the  constraints  generated  should  also  be 
merged. 

Finally,  the  systefn,  re-analyzes  the  orientation  constraints  to 


\ 


574 


determine  which  augmented  texels  are  part  of  the  same  constraint 
family  and  groups  them  together.  In  effect,  this  segments  the  image 
into  regions  of  similar  orientation.  The  surface  can  then  be 
reconstructed  from  these  surface  patches. 

The  robustness  of  this  approach  is  illustrated  by  a  system  that 
fuses  the  orientation  constraints  of  two  existing  shape-from  methods: 
shape-from-uniform-texel-spacing  [Moerdler  and  Kender  85],  and 
shape-from-uniform-texel-size  [Ohta  et.  al.  81].  These  two  methods 
generate  orientation  constraints  for  different  overlapping  classes  of 
textures. 


3.1  SURFACE  PATCH  AND  ORIENTATION 
CONSTRAINT  GENERATION 

The  first  phase  of  the  system  consists  of  multiple  shape-from- 
texture  components  which  generate  augmented  texels.  Each 
augmented  texel  consisting  of  a  texel  patch,  orientation  constraints 
for  the  texel  patch,  and  an  assurity  weighting  per  constraint  The 
orientation  constraints  are  stored  in  the  augmented  texel  as  vanishing 
points  which  are  mathematically  equivalent  to  a  class  of  other 
orientation  notations  (e.g.  Tilt  and  Pan  constraints)  [Shafer,  Kanade 
and  Kender  83].  Moreover,  they  are  simple  to  generate  and  compact 
to  store. 


2  HISTORICAL  BACKGROUND 

Current  methods  to  derive  shape-from-texture  are  based  on 
measuring  a  distortion3  that  occurs  when  a  textured  surface  is  viewed 
under  perspective.  This  perspective  distortion  is  imaged  as  a  change 
in  some  aspect  of  the  texture.  In  order  to  simplify  the  recovery  of  the 
orientation  parameters  from  this  distortion,  researchers  have  imposed 
limitations  on  the  applicable  class  of  textured  surfaces.  Some  of  the 
limiting  assumpcOns  include  uniform  texel  spacing  [Kender  80; 
Kender  83;  Moerdler  and  Kender  85],  uniform  texel  size  [Dceuchi  80; 
Ohta  et.  al.  81;  Aloimonos  and  Swain  85],  uniform  texel  density 
[Aloimonos  86],  and  texel  isotropy  [Witkin  80;  Davis,  Janos  and 
Dunn  83;  Dunn  84],  Each  of  these  are  strong  limitations  causing 
methods  based  on  them  to  be  appliable  to  only  a  limited  range  of  real 
images. 


3  DESIGN  METHODOLOGY 

The  generation  of  orientation  constraints  from  perspective 
distortion  uses  one  or  more  image  texels.  The  orientation  constraints 
can  be  considered  as  local,  defining  the  orientation  of  individual 
surface  patches  (called  texel  patches*)  each  of  which  covers  a  texel 
or  group  of  texels.  This  definition  allows  a  simple  extension  to  the 
existing  shape-from  methods  beyond  their  current  limitation  of 
planer  surfaces  or  simple  non  planer  surfaces  based  on  a  single 
textural  cue.  The  problem  can  then  be  considered  as  one  of 
intelligently  fusing  the  orientation  constraints  per  patch.  Dceuchi 
[Ikeuchi  80]  and  Aloimonos  [Aloimonos  and  Swain  85]  attempt  a 
similar  extension  based  on  constraint  propagation  and  relaxation  for 
planer  and  non  planer  surfaces  for  using  only  a  single  shape-from- 
texture  method. 

The  process  of  fusing  orientation  constraints  and  generating 
surfaces  can  be  broken  down  into  the  following  three  phases: 

1.  The  creation  of  texel  patches  and  multiple  orientation 
constraints  for  each  patch. 

2.  The  unification  of  the  orientation  constraints  per  patch 
into  a  "most  likely"  orientation. 

3.  The  formation  of  surfaces  from  die  texel  patches. 

Each  of  the  remaining  subsections  of  this  chapter  describes  one  of 
these  phases. 


'Under  the  uiumption  that  natural  texture  does  not  mimic  projective  effects  nor 
docs  it  cancel  those  effects  out. 


*Texel  patches  ate  defined  by  how  each  method  utilizes  the  texels.  Some  methods 
(e.g.  Uniform  texel  size)  use  a  measured  change  between  two  texeli;  in  this  case  the 
texels  patches  are  the  texels  themselves.  Other  methods  (e.g.  Uniform  texel 
density)  use  a  change  between  two  areas  of  the  image ,  in  this  case  the  texel  patches 
are  theae  predefined  areae. 


The  assurity  weighting  is  defined  separately  for  each  shape- 
from  method  and  is  based  upon  the  intrinsic  error  of  the  method.  For 
example,  shape-ffom-uniform-texel-spacing's  assurity  weighting  is  a 
function  of  the  total  distance  between  the  texel  patches  used  to 
generate  that  constraint.  A  low  assurity  value  is  given  when  the 
inter-texel  distance  is  small  (~3  pixels)  because  under  these 
conditions  a  small  digitization  error  causes  a  large  orientation  error. 
As  die  inter-texel  distance  increases  the  assurity  value  also  increases. 
At  a  heuristic  threshold,  it  starts  to  decrease  again  based  on  the  fact 
that  once  the  inter  texel  distance  grows  too  large  the  local  surface  is 
no  longer  approximated  by  a  plane  and  the  orientation  error  grows. 


3.2  MOST  LIKELY  ORIENTATION  GENERATION 

Once  the  orientation  constraints  have  been  generated  for  each 
augmented  texel,  the  next  step  consists  of  unifying  the  constraints 
into  one  orientation  per  augmented  texel.  The  major  difficulty  in 
deriving  this  "most  likely"  orientation  is  that  the  constraints  are 
errorful,  inconsistent,  and  potentially  incorrect  A  simple  and 
computationally  feasible,  solution  to  this  is  to  use  a  Gaussian  Sphere 
which  maps  the  orientation  constraints  to  points  on  the  sphere 
[Shafer,  Kanade  and  Kender  83].  A  single  vanishing  point 

circumscribes  a  great  circle  on  the  Gaussian  Sphere;  two  different 
constraints  generate  two  great  circles  that  overlap  at  two  points 
uniquely  defining  the  orientation  of  both  the  visible  and  invisible 
sides  of  the  surface  patch. 

The  Gaussian  sphere  is  approximated,  within  the  system,  by 
the  hierarchical  by  tesselated  Gaussian  Sphere  based  on  trixels 
(triangular  shaped  faces.  See  figure  2)  [Ballard  and  Brown  82; 
Fekete  and  Davis  84;  Korn  and  Dyer  86].  The  top  level  of  the 
hierarchy  is  the  icosahedron.  At  each  level,  other  than  the  lowest 
level  of  the  hierarchy,  each  trixel  has  four  children.  This  hierarchical 
methodology  allows  the  user  to  specify  the  accuracy  to  which  the 
orientation  can  be  calculated  by  defining  the  number  of  levels  of 
tesselation  that  are  created. 

The  system  generates  the  "most  likely”  orientation  for  each 
texel  patch  by  accumulating  evidence  for  all  the  constraints  for  the 
patch.  For  each  constraint,  it  recursively  visits  each  trixel  to  check  if 
the  constraint’s  great  circle  falls  on  the  trixel,  and  then  visiting  the 
children  if  the  result  is  positive.  At  each  leaf  trixel  the  likelihood 
value  of  the  trixel  is  incremented  by  the  constraint’s  weight.  The 
hierarchical  nature  of  this  approach  limits  the  number  of  trixels  that 
need  to  be  visited. 

Once  all  of  the  constraints  for  a  texel  patch  have  been 
considered,  a  peak  finding  program  smears  the  likelihood  values  at 
the  leaves.  Currently,  this  is  done  heuristically:  the  smeared  value  of 
each  leaf  is  equal  to  1/2  the  value  of  its  neighbors  plus  1/4  the  value 
of  all  its  neighbor’s  neighbors  that  are  not  a  neighbor  of  leaf.  This  is 
a  rough  approximation  to  a  gaussian  blur.  The  "most  likely" 
orientation  is  defined  to  be  the  trixel  with  the  largest  smeared  value. 


I 


i 


* 


575 


Figure  2:  The  Trixelated  Gaussian  Sphere 


3 3  SURFACE  GENERATION 

The  final  phase  of  the  system  generates  surfaces  from  the 
individual  augmented  texels.  This  is  done  by  re-analyzing  the 
orientation  constraints  generated  by  the  shape-from  methods  in  order 
to  determine  which  augmented  texels  are  part  of  the  same  surface.  In 
doing  this  the  surface  generation  is  also  performing  a  first 
approximation  of  surface  separation  and  segmentation. 

The  te-analysis  consists  of  iterating  through  each  Augmented 
Texel,  considering  all  its  orientation  constraints  and  determining 
which  constraints  aided  in  defining  the  "correct’'  orientation  for  the 
texel  patch  as  described  in  phase  two.  If  an  orientation  constraint 
correctly  determined  the  orientation  of  all  the  texels  that  were  used  in 
generating  the  constraint  then  these  augmented  texels  are  considered 
as  part  of  the  same  surface. 

Once  it  is  determined  which  augmented  texels  are  part  of  the 
same  surface,  the  surfaces  are  generated  by  a  simple  planer  surface 
generation  algorithm.  The  surfaces  are  defined  as  the  best  fit 
approximation  that  contains  all  the  related  texel  patches.  This 
approach  allows  both  the  generation  and  the  separation  of  surfaces 
that  are  connected  or  overlapping. 


4  TEST  DOMAIN 

The  knowledge  fusion  approach  outlined  in  the  previous 
section  has  been  applied  to  a  test  system  that  contains  two  shape- 
from-texture  methods,  shape-from-uniform-texel-spacing  [Moerdler 
and  Render  83],  and  shape-from-uniform-texel-size  [Ohta  et  al.  811. 
Each  of  the  methods  is  based  on  la  different,  limited  type  of  textures 
to  which  they  are  applicable.  Shape-form-uniform-texel-spacing 
derives  orientation  constraints  based  on  the  assumption  that  the 
texels  on  the  surface  are  of  arbitrary  shape  but  are  equally  spaced, 
while  shape-from-uniform-texel-size  is  based  on  the  unrelated 
criteria  that  the  spacing  between  texels  can  be  arbitrary  but  the  size 
of  all  of  the  texels  are  equivalent  but  unknown. 

In  shape-from-uniform-texel-size  if  the  distance  from  the 
center  of  mass  of  texel  T,  to  texel  T2  (tee  figure  3)  is  defined  as  D 
then  the  distance  from  the  center  of  texel  T2  to  a  point  on  die 
vanishing  line  can  be  rewritten  u  : 


F2  =  DxS2in/(S1m-S2m) 


Figure  3:  The  calculation  of  shape-from-uniform-texel-size 

In  shape-from-uniform-texel-spacing  the  calculations  are 
similar.  Given  any  two  texels  Tj  and  T2  (see  figure  4)  whose 
relative  positions  are  P,  and  P2,  if  the  distance  from  Tj  to  the  mid- 
texel  T3  is  equal  to  L  and  the  distance  from  T2  to  the  same  mid-texel 
Tj  is  equal  to  R,  the  vanishing  point  distance  X  is  given  by  : 


XALxPl-RxP2]l[L~R] 


Figure  4:  A  geometrical  representation  of  back-projecting. 

Under  certain  conditions  either  method  may  generate  incorrect 
constraints  which  are  ignored.  On  textures  that  are  solvable  by  both 
methods  the  methods  cooperate  and  correctly  define  the  textured 
surface  or  surfaces  in  the  image.  Some  images  are  not  solvable  by 


i 


1 


i 


i 

i 


j 


576 


either  method  by  itself  but  can  only  be  correctly  segmented  and  the 
surfaces  defined  by  the  interaction  of  the  cues  (i.e.  the  upper  right 
texel  of  figure  15  ). 


5  THE  EFFECTS  OF  NOISE 

The  real,  camera  generated,  images  contains  noise  and 
shadows  which  are  effectly  ignored  by  the  system  in  many  cases. 
The  system  treats  shadows  as  potential  surface  texels  (see  texels  9 
and  13  in  figure  5)  and  uses  them  to  compute  orientation  constraints. 
Since  many  texels  are  used  in  generating  the  orientation  for  each 
individual  texel  the  effect  of  shadow  texels  is  minimized5. 

Noise  can  occur  in  many  ways:  it  can  create  texels,  and  it  can 
change  the  shape,  size,  or  position  of  texels.  If  noise  texels  are 
sufficiently  small  then  they  are  ignored  in  die  texel  finding 
components  of  the  shape-from  methods.  When  they  are  large,  they 
are  treated  in  much  the  same  way  as  shadow  texels  and  thus  often  do 
not  effect  the  orientation  of  the  surface  texel  patches.  Since  many 
texels  are  used  and  more  than  one  shape-from  method  is  employed, 
noise-created  changes  in  the  shape  of  texels  can  perturb  the 
orientation  results,  but  the  effect  appears  negligible  as  shown  in  the 
experimental  results. 

6  EXPERIMENTAL  RESULTS 

The  system  has  been  tested  over  a  range  of  both  synthetic  and 
natural  textured  surfaces,  and  appears  to  show  robustness  and 
generality.  Three  examples  are  given  on  real,  noisy  images  that 
demonstrate  the  cooperation  among  the  shape-from  methods. 

The  first  image,  figure  5,  shows  a  real  image  of  a  man-made 
texture  consisting  of  equally  spaced,  equally  sized  circles.  The 
system  finds  fifteen  texels:  the  twelve  texels  on  the  surface,  plus 
three  noise  texels  located  in  the  background.  It  is  able  to  generate  the 
correct  q  value  for  each  of  the  twelve  surface  texels  and  a  p  value 
that  is  off  by  2  degrees  (see  figure  1  for  the  positions  of  the  texels 
and  figure  9  for  the  individual  p  and  q  values.)  In  figure  8  the 
orientations  of  the  texel  patches  are  displayed  as  needle-like  surface 
normal  vectors. 

The  system  is  also  also  able  to  segmented  the  image  into 
surfaces,  one  of  which  contains  only  the  twelve  correct  surface 
texels.  The  noise  generated  texels  are  each  individually  marked  as 
part  of  separate  surfaces. 

The  second  example  consists  of  a  surface  textured  with  twelve 
uniformly  sized  coins  (see  figure  10).  The  system  in  this  case 
generates  p  and  q  values  that  are  approximately  correct  (two  degrees 
for  the  p  and  exactly  correct  for  the  q.  See  figure  13)  for  ten  of  the 
twelve  texel  patches.  The  other  two  texels  have  a  greater  error  (five 
degrees  for  the  p  and  correct  for  the  q)  due  to  digitization  errors  (see 
figure  14).  The  constraints  generated  by  the  shape-from-uniform- 
texel-spacing  algorithm  are  ignored. 

The  final  example  is  an  image  of  a  box  of  breakfast  buns  in 
which  one  bun  is  missing  (see  figure  15).  Eleven  texels  are  found, 
eight  of  which  are  patches  of  filling  on  the  surface  of  buns;  two  are 
noisy  data  on  the  buns;  and  one  is  part  of  the  packaging.  The  texels 
are  only  approximately  evenly  spaced  and  approximately  evenly 


3 Even  under  the  conditions  where  many  ihsdow  texels  are  found  they  do  not 
effect  the  computed  orientation  of  surface  texels  to  long  as  the  placement  of  the 
shadow  texels  does  not  mimic  perspective  distortion. 


Figure  5:  Surface  textured  with  evenly  spaced  evenly  sized  circles 
sized. 

The  system  is  able  to  generate  approximately  correct  surface 
(see  figure  16)  normals  for  all  but  onesof  the  texels  on  the  breakfast 
buns  and  an  approximate  orientation  for  the  outside  of  the  package. 
The  upper  left  hand  texel  contains  two  equally  weighted  oriem  'tions. 
Without  the  shape-from-uniform-texel-spacing  method  the  or  rect 
orientation  would  not  have  been  generated  for  this  patch. 


Figure  6:  The  texels  found  for  the  evenly  spaced  circles  texture 


7  CONCLUSION  AND  FUTURE  RESEARCH 

In  this  paper  a  system  has  been  proposed  that  can  fuse  the 
results  of  a  number  of  shape-from-texture  methodologies  to  generate 
surface  segments  and  their  orientations.  The  system  has  been  tested 
using  two  existing  shape-from  methods,  shape-from-uniform-texel- 
spacing  and  shape-from-uniform-texel-size  and  has  shown  the  ability 
to  recover  the  surfaces  and  their  orientation  under  noisy  conditions. 


577 


The  robustness  of  the  system  has  been  exercised  using  images 
that  contain  multiple  surfaces,  by  surfaces  that  are  solvable  by  either 
method  alone,  and  finally  by  surfaces  that  are  solvable  by  using  only 
both  methods  together. 

Future  enhancements  to  the  system  would  include  the  addition 
of  other  shape-from-texture  modules,  the  investigation  of  other 
means  of  fusing  information  (such  as  object  model  approaches),  and 
optimization  of  the  method,  especially  in  a  parallel  processing 
environment. 


Figure  10:  An  image  of  a  surface  covered  with  coins 


Figure  8:  Surface  normals  for  the  surface  textured  with  circles 


Texel  Number 

Measured  p  &  q  ; 

Actual  p  &  q 

%  error 

p  -2.6 

p  -  3.0 

2° 

0  to  8 

a 

o 

o 

q  -  0.0 

0° 

9 

p-11.0 

q-6.7 

Shadow  Texel 

10  to  12 

p  -2.6 

O 

CO 

R 

Q. 

2° 

o 

13 

q  -0.0 

p-11.0 

q-6.7 

q  -  0.0 

!  ShadowTexel 

| 

0 

Figure  9:  Orientation  values  for  the  surface  textured  with  circles 


570 


€  I  % 


Figure  12:  The  texel  numbers  of  the  coins 
Texel  Number;  Measured  p  &  q  |  Actual  p  &  q ;  %  error 


p-5.5 

p-3.0 

5° 

q=0.0 

q  =  0.0 

0° 

p  =2.6 

o 

CO 

H 

Q. 

2° 

q=0.0 

q  =  0.0 

0° 

p»5.5 

p  =  3.0 

5° 

q=0.0 

q  =  0.0 

0° 

p  =2.6 

p  *  3.0 

2° 

q  =0.0 

q  -  0.0 

0° 

p=5.5 

p  =  3.0 

5° 

q=0.0 

O 

o 

H 

cr 

0° 

Figure  13:  Orientation  values  for  the  surface  textured  with  coins 


Figure  14:  surface  normals  generated  for  the  coins 


Figure  15:  A  box  of  breakfast  buns  with  one  bun  missing 


Figure  16:  The  surface  normals  generated  for  the  box  of  buns 


579 


References 


[Aloimonos  86]  John  Aloimonos. 

Detection  of  Surface  Orientation  and  Motion  from 
Texture:  1.  The  Case  of  Planes. 

In  Proceedings  of  Computer  Vision  Pattern 
Recognition  Conference.  Computer  Vision 
Patttem  Recognition,  IEEE  Computer  Society, 
1986. 

[Aloimonos  and  Swain  85] 

John  Aloimonos  and  Michael  J.  Swain. 

Shape  from  Texture. 

In  Proceedings  of  the  Tenth  International  Joint 
Conference  on  Artificial  Intelligence.  UCAI, 
IJCAI,  1985. 

[Ballard  and  Brown  82] 

Dana  Ballard  and  Christopher  Brown. 

Computer  Vision. 

Prentice-Hall  Inc.,  1982. 

[Davis,  Janos  and  Dunn  83] 

Lany  S.  Davis,  Lubvik  Janos,  and  Stanley 
M.  Dunn. 

Efficient  Recovery  of  Shape  from  Texture. 

IEEE  Transactions  on  Pattern  Analysis  and 
Machine  Intelligence  PAMI-5(5):485-492, 
September  1983. 

[Dunn  84]  Stanley  M.  Dunn,  Larry  S.  Davis,  and  Hannu 

A.  Hakalahti. 

Experiments  in  Recovering  Surface  Orientation 
from  Texture. 

Technical  Report  CAR-TR-61,  University  of 
Maryland  Center  For  Automation  Research, 
May  1984. 

[Fekete  and  Davis  84] 

Gyorgy  Fekete  and  Larry  S.  Davis. 

Property  Spheres:  A  New  Representation  For  3-D 
Object  Recognition. 

Proceedings  of  the  Workshop  on  Computer  Vision 
Representation  and  Control :  192  -  201,  1984. 

[Gibson  50]  James  J.  Gibson. 

Perception  of  the  Visual  World. 

Riverside  Press,  1950. 

[Ikeuchi  80]  Katsushi  Ikeuchi. 

Shape  from  Regular  Patterns  (an  Example  from 
Constraint  Propagation  in  Vision). 
Proceedings  of  the  Internation  Conference  on 
Pattern  Recognition  :  1032- 1039,  December 
1980. 

[Kanatani  and  Chou  86] 

Ken-ichi  Kanatani  and  Tsai-Chia  Chou. 

Shape  from  Texture:  General  Principle. 

In  Proceedings  of  Computer  Vision  Pattern 
Recognition  Conference.  Computer  Vision 
Patttern  Recognition,  IEEE  Computer  Society, 
1986. 

[Render  80]  John  R.  Render. 

Shape  from  Texture. 

PhD  thesis,  C.M.U.,  1980. 

[Render  83]  John  R.  Render. 

Surface  Constraints  from  Linear  Extents. 
Proceedings  of  the  National  Conference  on 
Artificial  Intelligence  ,  March  1983. 


[Rom  and  Dyer  86] 

Matthew  R.  Rom  and  Charles  R.  Dyer. 

3-D  Multiview  Object  Representation  for  Model- 
Based  Object  Recognition. 

Technical  Report  RC  1 1760,  IBM  T.J.  Watson 
Research  Center,  March  86. 

[Moerdler  and  Render  85] 

Mark  L.  Moerdler  and  John  R.  Render. 

Surface  Orientation  and  Segmentation  from 
Perspective  Views  of  Parallel-Line  Textures. 

Technical  Report,  Columbia  University,  1985. 

[Ohta  et  al.  81]  Yu-ichi  Ohta,  Kiyostn  Maenobu,  and  Toshiyuki 

Sakai. 

Obtaining  Suface  Orientation  from  Texels  under 
Perspective  Projection. 

In  Proceedings  of  the  Seventh  international  Joint 
Conference  on  Artificial  Intelligence.  UCAI, 
UCAI,  1981. 

[Shafer,  Kanade  and  Render  83] 

Steven  A.  Shafer  and  Takeo  Kanade  and  John 

R.  Render. 

Gradient  Space  under  Orthography  and 
Perspective. 

Computer  Vision,  Graphics  and  Image  Processing 
(24),  1983. 

[Stevens  79]  Kent  A.  Stevens. 

Surface  Perception  from  Local  Analysis  of 
Texture  and  Contour. 

PhD  thesis,  Massachusetts  Institute  of 
Technology,  1979. 

[Tsai  86]  Roger  Y.  Tsai. 

An  Efficient  and  Accurate  Camera  Calibration 
Technique  for  3D  Machine  Vision. 

In  Proceedings  of  Computer  Vision  Pattern 
Recognition  Conference.  Computer  Vision 
Patttem  Recognition,  IEEE  Computer  Society, 
1986. 

[Witkin  80]  Andrew  ?.  Witkin. 

Recovering  Surface  Shape  from  Orientation  and 
Texture. 

In  Michael  Brady  (editore).  Computer  Vision, 
pages  17-45.  North-Hoi'.? nd  Publishing 
Company,  1980. 


580 


DRIVE  -  Dynamic  Reasoning  from  Integrated  Visual  Evidence 


Bir  Bhanu  and  Wilhelm  Burger 


Honeywell  Systems  &  Research  Center 
3660  Technology  Drive,  Minneapolis,  MN  55418 


ABSTRACT 

The  vision  system  of  an  Autonomous  Land  Vehicle  is 
required  to  handle  complex  dynamic  scenes.  Vehicle  motion 
and  individually  moving  objects  in  the  field  of  view  contri¬ 
bute  to  a  continuously  changing  camera  image.  The  goal  of 
"motion  understanding"  is  to  find  consistent  three- 
dimensional  interpretations  for  those  changes  in  the  image 
sequence.  The  estimation  of  vehicle  motion,  the  detection  of 
moving  objects  in  the  scene  and  the  description  of  their 
movements  are  the  key  issues.  We  present  a  new  approach 
to  this  problem,  which  departs  from  previous  work  by 
emphasizing  a  qualitative  line  of  reasoning  and  modeling, 
where  multiple  interpretations  of  the  scene  are  pursued 
simultaneously.  Different  sources  of  visual  information  are 
integrated  in  a  rule-based  framework  to  construct  and  main¬ 
tain  a  vehicle-centered  three-dimensional  model  of  the  scene. 
It  is  shown  that  this  approach  offers  significant  advantages 
over  "hard"  numerical  techniques  which  have  been  proposed 
in  the  motion  understanding  literature. 


1.  INTRODUCTION 

Visual  information  from  the  environment  is  an 
indispensable  clue  for  the  operation  of  the  Autonomous 
Land  Vehicle  (ALV).  Even  with  the  use  of  sophisticated 
inertial  navigation  systems,  the  accumulation  of  position 
errors  requires  periodic  corrections.  Mission  tasks  involving 
search  and  rescue,  exloration  and  manipulation  (e.g.  refuel¬ 
ing  and  munitions  deployment)  critically  depend  on  visual 
information.  Motion  becomes  a  natural  component  as  soon 
as  moving  objects  are  encountered  in  some  form,  while  fol¬ 
lowing  a  convoy,  approaching  other  vehicles  or  detecting 
threats.  The  presence  of  moving  objects  and  their  behavior 
must  be  known  to  provide  appropriate  counteraction.  In 
addition  to  that,  image  motion  provides  important  cues  about 
the  spatial  layout  of  the  environment  and  about  the  actual 
movements  of  the  vehicle.  As  part  of  the  vehicle  control 
loop,  visual  motion  feedback  is  beneficial  for  path  stabiliza¬ 
tion,  steering  and  braking.  Results  from  psychophysics9,16 
show,  that  humans  rely  heavily  on  visual  motion  for  motor 
control. 

While  the  vehicle  is  moving  itself,  the  entire  camera 
image  is  changing  continuously.  The  interpretation  of  com¬ 


plex  dynamic  scenes  is  therefore  the  continuous  task  for  the 
vision  system  of  an  autonomous  land  vehicle.  Any  change 
in  the  2-D  image  is  always  the  result  of  a  change  in  3-D 
space,  either  induced  by  vehicle  motion  or  by  moving 
objects.  Finding  a  consistent  interpretation  for  every  change 
in  the  image  is  the  objective  of  "motion  understanding". 

Previous  work  in  motion  analysis  has  mainly  concen¬ 
trated  on  numerical  approaches  for  reconstructing  motion 
and  scene  structure  from  image  sequences.  Recently  Nagel13 
has  given  a  comprehensive  review.  While  a  completely  sta¬ 
tionary  environment  has  been  assumed  in  most  previous 
work  on  the  reconstruction  of  camera  motion,  the  possible 
presence  of  moving  objects  must  be  accounted  for  in  this 
scenario.  Similarly,  we  cannot  rely  on  a  fixed  camera  setup 
to  detect  those  moving  objects.  Clearly,  some  kind  of  com¬ 
mon  reference  is  required,  against  which  the  movement  of 
the  vehicle  as  well  as  the  movement  of  objects  in  the  scene 
can  be  related. 

Extensive  work  has  been  done  in  determining  the  rela¬ 
tive  motion  and  rigid  3-D  structure  of  a  given  set  of  image 
points,  basically  following  two  approaches.  In  the  first 
approach,  structure  and  motion  are  computed  in  one  integral 
step  by  solving  a  system  of  linear  or  nonlinear  equa¬ 
tions11,12, 19,23  from  a  minimum  number  of  points  on  a 
rigid  object.  The  method  is  reportedly  sensitive  to  noise.4,22 
Recent  work2,3,6, 18,21  has  addressed  the  problem  of  recov¬ 
ering  and  refining  3-D  structure  from  motion  over  extended 
periods  of  time,  demonstrating  that  fairly  robust  results  can 
be  obtained.  However,  these  approaches  require  distinct 
views  of  the  object  (the  environment),  which  are  generally 
not  available  to  a  moving  vehicle.  Besides  that  it  seems, 
that  the  noise  problem  cannot  be  overcome  by  simply 
increasing  the  time  of  observation. 

The  second  approach2,7,9,10,14,17  makes  use  of  the 
unique  expansion  pattern,  which  is  experienced  by  a  moving 
observer.  Arbitrary  observer  motion  can  be  decomposed  into 
translational  and  rotational  components  from  the  2-D  image, 
without  computing  the  structure  of  the  scene.  Figure  1 
shows  the  basic  viewing  geometry  for  a  moving  observer. 
The  vector  representing  the  direction  of  instantaneous  head¬ 
ing  intersects  the  image  plane  at  the  "focus  of  expansion" 
(FOE).  In  a  stationary  environment  every  point  in  the  image 
seems  to  expand  from  the  FOE.  The  closer  a  point  is  in  3-D, 
the  more  rapidly  it  expands  in  the  image.  Thus  for  a  sta¬ 
tionary  scene,  its  three-dimensional  structure  can  be  obtained 


581 


from  the  expansion  pattern.  Some  obvious  cases  of  object 
motion  are  easy  to  detect,  such  as  motion  towards  the  FOE, 
whereas  other  types  of  motion  are  not  so  obvious. 


Figure  1.  The  basic  viewing  geometry  for  a  moving  observer15  (left). 
The  Focus  of  Expansion  points  at  the  direction  of  instantaneous  heading. 
Stationary  points  in  the  scene  seem  to  expand  from  one  point  (the  FOE) 
when  the  camera  moves  along  a  straight  line17  (right).  The  rate  of 
expansion  indicates  how  close  to  the  projection  plane  a  point  is  in  3-D. 


Our  approach  builds  on  the  second  method,  using  the 
expanding  field  of  displacement  vectors  as  the  main  source 
of  information.  Following  a  generate-and-test  strategy, 
hypotheses  about  the  structure  of  the  environment  are 
created  from  relationships  in  the  image,  which  are  checked 
for  inconsistencies  over  time.  Instead  of  creating  a  numeri¬ 
cal  description,  the  interpretation  of  the  scene  is  accom¬ 
plished  in  a  qualitative  fashion.8 

The  approach  we  are  presenting  is  unique  in  several 
respects.  First,  a  qualitative  line  of  reasoning  and  modeling 
is  pursued,  which  offers  significant  advantages  over  "hard" 
numerical  techniques.  Second,  motion  understanding  is 
attempted  in  a  hypothesize-and-test  environment,  where  mul¬ 
tiple  interpretations  of  the  scene  may  be  active  at  the  same 
time.  In  addition  to  the  two-dimensional  displacement  field, 
we  employ  multiple  sources  of  information,  such  as  spatial 
reasoning  and  semantics  for  the  construction  and  mainte¬ 
nance  of  the  scene  model. 

The  choice  of  a  suitable  spatial  description  of  the 
environment5  is  an  important  component  of  motion  under¬ 
standing.  A  vehicle-centered  model  of  the  scene  is  con¬ 
structed  and  maintained  over  time,  representing  the  current 
set  of  feasible  interpretations  of  the  scene.  In  contrast  to 
most  previous  approaches,  we  do  not  attempt  an  accurate 
geometric  description  of  the  scene.  Instead  we  propose  a 
Qualitative  Scene  Model,  which  holds  only  a  coarse  qualita¬ 
tive  representation  of  the  three-dimensional  environment.  As 
part  of  this  model,  the  "stationary  world"  is  represented  by  a 
set  of  image  locations,  forming  a  rigid  3-D  configuration 
which  is  believed  to  be  stationary.  All  the  motion-related 
processes  at  the  intermediate  level  use  this  model  as  a  cen¬ 
tral  reference.  The  vehicle’s  self-motion,  for  instance,  must 
be  related  to  the  stationary  parts  of  the  environment,  even  if 
large  parts  of  the  image  are  in  motion.  We  argue  that  this 


kind  of  qualitative  modeling  and  reasoning  is  sufficient  and 
efficient  for  the  problem  at  hand. 

While  most  previous  work  can  be  related  to  low-level 
computer  vision,  the  integration  of  motion  into  higher  levels, 
i.e.  actual  motion  understanding  is  still  in  its  infancy.  Here 
we  present  a  set  of  motion-related  processes  allocated  at  an 
intermediate  level  of  computer  vision.  These  processes 
bridge  the  representational  gap  between  purely  two- 
dimensional,  image-centered  low-level  processes  and  the 
three-dimensional,  world-centered  high-level  processes. 
High-level  processes  keep  a  global  view  of  the  scene, 
characterized  by  goal-directed  activities  (focus  of  attention). 
The  long-term  interactions  between  objects  are  handled  at 
this  level.  The  intermediate  level  performs  "routine"  tasks, 
which  do  not  require  focused  attention,  but  short-term 
memory  and  knowledge.  As  far  as  motion  is  concerned,  we 
want  a  vision  system  to  be  "surprised"  only  in  cases  of 
really  unexpectable  changes  in  the  image.  Changes  due  to 
normal  self-motion,  short-term  occlusion  etc.  should  not 
invoke  attentative  action  at  the  high  level. 


2.  UNDERSTANDING  MOTION 

PROBLEM  STATEMENT 

The  purpose  of  our  approach  is  to  construct  and  main¬ 
tain  a  consistent  interpretation  of  time-varying  images 
obtained  from  the  camera  on  a  moving  vehicle.  Specifically 
we  are  interested  to  determine 

•  how  is  the  ALV  moving  ? 

•  what  is  moving  in  the  scene  and  how  does  it  move  ? 

•  what  is  the  approximate  3-D  structure  of  the  scene  ? 

The  original  input  is  a  sequence  of  images,  taken  at  a  con¬ 
stant  rate.  A  low-level  correspondence  technique  is  avail¬ 
able,  which  extracts  distinct  features  (points,  lines  or 
regions)  from  every  image  and  supplies  the  2-D  displace¬ 
ment  between  successive  frames  for  every  feature.  This  is 
the  actual  input  to  our  vision  process.  Recently  a  promising 
technique1  for  estimating  the  displacement  field  around 
region  boundaries  was  devised,  which  makes  this  class  of 
features  attractive  for  our  approach.  At  the  present  stage 
points  are  the  only  type  of  features  being  used.  We  assume 
that  the  problem  of  computing  reliable  displacement  vectors 
for  individual  points  (the  "Correspondence  Problem")20  is 
solved.  The  correspondence  algorithm  labels  each  feature 
and  tracks  it  over  time  by  generating  a  list  of  tupels: 

( feature-label ,  t^,y0  ) 

(feature-label,  i<r+U,,y,  ) 

(feature-label,  to+kshyk ) 

When  no  correspondence  is  found,  either  because  of  occlu¬ 
sion  or  because  the  feature  left  the  field  of  view,  the  list  of 
tupels  for  this  feature  is  discontinued.  The  low-level  process 
is  not  expected  to  track  a  feature  through  occlusion,  this  is 
the  task  of  intermediate-level  processes,  where  short-term 
memory  is  available. 


582 


DETERMINATION  OF  VEHICLE  MOTION 

The  first  step  of  our  approach  is  to  determine  the 
vehicle’s  motion  relative  to  the  stationary  environment 
between  each  pair  of  frames.  If  the  vehicle  moves  along  a 
straight  line,  all  stationary  features  seem  to  expand  from  one 
single  point  in  the  image,  the  focus  of  expansion.  In  the 
general  case,  vehicle  rotation  induces  an  additional  vector 
field  in  the  image.  The  fact  that  all  the  displacement  vectors 
of  stationary  features  must  intersect  in  a  common  point  can 
be  used  to  "derotate”  the  image  and  compute  the  vehicle’s 
rotation  and  direction  of  translation.  Due  to  inertia,  the 
direction  of  translation  as  well  as  the  amount  of  rotation 
about  either  axis  cannot  change  drastically  between  two 
frames.  Thus  the  previous  motion  parameters  can  serve  as  a 
good  guess  for  finding  the  current  optimum. 

We  do  not  assume  that  the  exact  location  of  the  FOE  in 
the  image  plane  can  always  be  determined,  but  that  a  region 
can  be  specified,  which  contains  the  FOE  with  high  cer¬ 
tainty.  However,  the  smaller  the  region  of  possible  FOE- 
locations  is,  the  more  conclusions  can  be  drawn  for  the 
interpretation  of  the  scene.  Although  this  first  step  is  done 
exclusively  in  the  image  plane,  knowledge  about  the  station¬ 
ary  features  to  be  used  must  be  available  in  some  form. 
This  information  is  supplied  by  the  Qualitative  Scene  Model, 
which  is  described  in  the  following  section. 

Given  the  accurate  location  of  the  FOE  (xt.ye),  the  rela¬ 
tive  range  of  any  (stationary)  point  P  in  the  image  can  be 
determined  at  time  t  by  the  relation 

m  “  V(,)  fS  ’ 

where  Z(t)  is  the  actual  distance  of  the  point  from  the  image 
plane  in  3-D,  V(t)  is  the  velocity  dZJdt  of  the  vehicle  perpen¬ 
dicular  to  the  image  plane.  r(t)  is  the  2-D  distance  between 
the  image  of  P  and  the  FOE,  v(t)  is  the  radial  velocity  dr/dt. 
Since  V(t)  is  the  same  for  any  stationary  point  in  the  scene, 
r(t)  and  v(t)  can  be  measured  in  the  image  and  depth  can  be 
reconstructed  up  to  a  common  scale  factor. 

While  the  vehicle  is  traversing  the  environment,  Z(t) 
keeps  changing  for  every  feature  in  the  field  of  view.  If  the 
depth  map  itself  was  used  as  the  scene  model,  the  model 
would  have  to  be  updated  continuously,  since  the  depth  of 
objects  is  changing  as  the  ALV  moves  forward.  The  topol¬ 
ogy  of  a  stationary  scene,  however,  remains  unchanged.  The 
key  idea  of  our  approach  is  to  construct  and  maintain  a 
topological  model  of  the  scene  from  observations  in  the 
time-varying  images.  If,  as  an  example,  for  two  3-D 
features  A3  with  image  points  a,b 

rj.t)  rUO 

v/t)  »*<*) 

for  every  possible  location  of  the  FOE,  then  we  may  con¬ 
clude  that  A  is  actually  closer  to  the  image  plane  than  B.  If 
A  and  B  are  really  stationary,  this  relationship  must  hold  in 
any  future  observation.  Otherwise  it  can  be  concluded  that  at 
least  one  of  the  two  features  is  not  stationary. 


DETECTING  3-D  MOTION 

A  central  issue  of  motion  understanding  is  the  detection 
of  actual  movements  in  the  3-D  environment.  Given  the 
location  of  the  FOE  and  the  derotated  displacement  field  (as 
explained  earlier),  some  forms  of  3-D  motion  are  easy  to 
detect,  whereas  others  are  of  a  more  subtle  nature.  For 
instance,  motion  of  an  image  point  towards  the  FOE  is  a 
striking  evidence  of  3-D  motion  in  the  scene.  In  this  particu¬ 
lar  case,  something  is  obviously  moving  into  the  ALV’s  tra¬ 
jectory.  This  is,  however,  not  the  most  "dangerous"  case. 
Consider  the  situation  in  Figure  2,  seen  from  the  ALV  while 
it  is  approaching  an  intersection.  The  shaded  area  in  the 
center  represents  the  region  of  possible  locations  of  the  FOE. 
The  building  on  the  right  seems  to  expand  away  from  the 
FOE,  while  the  truck  (which  is  actually  moving)  stays  at  a 
constant  image  location. 


Figure  2.  A  typical  situation:  the  ALV  is  approaching  an  intersection. 
Two  feature  points  are  tracked  in  the  image,  point  A  on  the  (moving) 
truck,  and  point  B  on  the  building.  The  stationary  part  of  the  image 
Seems  to  expand,  while  the  truck  stays  at  a  constant  image  location. 
The  shaded  area  in  the  center  represents  the  region  of  possible  FOE- 
locations. 

From  the  expansion  pattern  alone,  the  feature  point  A  on  the 
truck  will  be  interpreted  as  stationary  at  infinite  distance. 
However,  if  the  truck  and  the  ALV  keep  moving  in  this 
manner  (with  the  image  of  the  truck  staying  at  the  same 
location),  the  two  will  inevitably  collide.  Although  in  reality 
the  truck  is  unlikely  to  show  no  image  motion  at  all,  the 
detection  of  this  threat  in  the  presence  of  noise  is  not  trivial. 
It  is  only  possible,  when  changes  in  the  image  can  be  related 
to  the  spatial  layout  of  the  scene.  This  allows  reasoning  in  a 
fashion  like: 

"Since  the  truck  is  occluding  the  building,  it  must  be 
closer  to  the  ALV  than  the  building.  The  image  of  the 
building  is  expanding,  thus  it  is  at  a  finite  distance.  There¬ 
fore  the  truck  cannot  be  at  infinite  distance  (the  original 
interpretation  was  wrong).  Probably  the  truck  is  moving 
towards  the  ALV." 

From  knowledge  about  the  spatial  relationships  between 
features  in  3-D  and  their  motion  in  the  image,  we  can  make 
conclusions  about  actual  motion  in  the  3-D  environment. 
However,  different  sources  of  knowledge  and  different  paths 
of  reasoning  must  be  combined  to  come  up  with  unambigu¬ 
ous  interpretations. 


OCCLUSION  ANALYSIS 

Whenever  new  image  features  appear  in  the  field  of 
view,  they  must  be  categorized  and  integrated  into  the 
current  interpietation(s)  of  the  scene.  Thus  the  appearance 
of  a  feature  constitutes  an  event  of  major  activity  for  a 
motion  analysis  system.  The  same  can  be  said  for  the  disap¬ 
pearance  of  an  image  feature.  The  appearance  of  a  feature 
can  happen,  when 

•  the  feature  had  not  been  in  the  field-of-view  before, 

•  the  feature  had  been  occluded,  or 

•  the  feature  is  moving. 

Every  feature  must  be  taken  care  of  at  least  once,  when 
it  becomes  visible  for  the  first  time.  If  no  global  "world 
model"  is  available,  this  class  of  events  cannot  be  predicted 
by  any  means.  In  contrast,  given  an  appropriate  model  of 
the  scene,  the  disappearance  of  a  feature  might  well  be 
predicted.  This  is  not  only  important  for  understanding  the 
event  of  disappearance  itself,  but  it  provides  additional  infor¬ 
mation  about  the  consistency  of  the  model.  If,  for  instance, 
a  stationary  feature  A  does  not  occlude  another  feature  B 
within  the  predicted  period  of  time,  either  the  spatial  rela¬ 
tionships  assumed  by  the  model  are  incorrect,  or  one  of  the 
two  features  is  moving.  In  either  case  the  model  requires  an 
update. 

If  the  ALV  is  moving  in  an  environment  containing  a 
large  number  objects  at  different  depths,  temporary  occlu¬ 
sion  of  features  will  happen  frequently.  Whenever  a  feature 
becomes  invisible,  the  low-level  correspondence  process  will 
stop  tracking  this  feature  and  discontinue  the  associated  list 
of  position-tupels  (see  above).  This  happens  even  if  the 
feature  is  invisible  for  only  one  frame  period,  since  the  low- 
level  process  does  not  estimate  the  image-trajectory  of  a 
feature.  When  the  feature  finally  reappears  out  of  a  tem¬ 
porary  occlusion,  the  low-level  process  assigns  a  new 
feature-label  to  it  and  resumes  to  track  it  in  the  image  just 
like  a  feature  which  has  never  been  encountered  before.  The 
intermediate-level  motion  process,  however,  has  most  of  the 
information  available  to  predict  the  reappearance  of  a  feature 
from  temporary  occlusion.  This  not  only  reduces  the 
amount  of  "surprise"  which  must  be  handled  in  such  an 
event,  but  can  also  direct  the  low-level  process  to  look  out 
for  a  new  feature  in  a  certain  image  region.  Since  this  is 
part  of  a  larger  project  for  ALV-related  vision  the  integra¬ 
tion  of  map  information  for  occlusion  analysis  and  tracking 
is  among  the  final  goals. 

RULES 

There  are  various  independent  sources  that  may  contri¬ 
bute  to  the  creation  of  scene  interpretations:  geometry,  spa¬ 
tial  reasoning,  and  semantics.  Each  of  these  three  types  of 
knowledge  is  cast  into  a  set  of  rules,  which  is  applied  to 
time-varying  images. 

Conclusions  from  the  expansion  pattern  in  the  described 
above  are  based  only  on  the  geometry  of  the  imaging  pro¬ 
cess  and  the  vehicle  motion.  The  following  set  of  geometric 
rules  should  illustrate  the  idea. 


Rule  GEOMETRY-1:  This  simple  rule  can  be  applied  after 
the  image  has  been  derotated  and  the  region  of  possible  FOE 
locations  has  been  determined, 
if 

point  a  is  moving  toward  the  FOE  region 
then 

hypothesize  (mobile  a). 

Rule  GEOMETRY-2:  If  the  image  has  been  derotated  and  the 
region  of  possible  FOE  locations  is  known,  then  for  every 
point  a  the  minimum  and  maximum  distance  r?*,  r*“  and 
velocity  if?*,  vf“  relative  to  the  FOE  can  be  computed, 
if 

( stationary  a) 

( stationary  b) 

(  ) 

then 

hypothesize  (closer  a  b). 

Since  the  relative  distance  between  image  points  is  not 
affected  by  small  amounts  of  vehicle  rotation,  the  image 
needs  not  to  be  derotated  in  the  following  rule.  Only  a  very 
coarse  estimate  of  the  FOE’s  location  must  be  available, 
which  may  be  obtained  from  previous  frames. 

Rule  GEOMETRY-3: 

An  image  point  a  is  said  to  be  inside  a  point  b,  if  a  is  closer 
to  the  FOE  than  b.  For  pairs  of  points  in  a  certain  neighbor¬ 
hood,  this  is  easy  to  determine.  D^(i)  denotes  the  Euklidean 
distance  from  a  to  b  at  time  t. 

if 

(stationary  a) 

(stationary  b) 

(inside  a  b) 

(£>^(0>O^(<+l)  ) 

then 

hypothesize  (closer  a  b). 

The  following  constraint- rule  verifies  that  a  hypothesis  gen¬ 
erated  in  the  previous  rule  can  be  maintained  over  time. 
Constant  vehicle  motion  is  implicitly  assumed. 

Rule  GEOMETRY-4:  Due  to  the  monotonic  functions 
involved  in  the  imaging  process,  motion  between  two  (sta¬ 
tionary)  image  points  a,b  must  continue,  once  it  has  started, 
if 

(stationary  a) 

(stationary  b) 

(  >  D^(/+l) ) 

then 

verify  (  D^(i+1)  >  D^i+2)  ). 

The  results  from  geometry  are  valid  for  any  arbitrary 
configuration  of  points  in  space.  However,  certain  assump¬ 
tions  can  be  made  about  the  spatial  layout  of  the  scene 
which  will  be  encountered.  For  instance,  the  vehicle  always 
travels  on  some  land  of  surface,  such  that  the  profile  of  the 
view  is  more  or  less  convex  (Fig.  3). 

If  the  camera  is  fixed  to  the  vehicle,  such  that  the  wheels 
and  the  ground  are  below,  we  can  define  a  (heuristic)  rule, 
that  features  which  are  lower  in  the  image  are  generally 
closer  to  the  vehicle. 


584 


Figure  3.  The  convex  nature  of  the  viewing  profile.  It  allows  the  heuris¬ 
tic  assumption,  that  features  lower  in  the  image  are  generally  closer  to 
the  vehicle. 


Rule  SPATIAL- 1:  For  any  pair  of  image  points  a,b: 

if 

( lower  a  b  ) 

then 

hypothesize  (  closer  a  b  ) 

As  a  consequence,  it  is  very  unlikely,  that  a  feature  is  farther 
away  than  all  its  surrounding  neighbors. 

Occlusion  is  another  important  source  of  spatial  infor¬ 
mation,  which  is  applicable  for  more  complex  features  such 
as  lines  and  regions.  A  feature  occluding  another  feature  is 
closer  to  the  viewer  than  the  occluded  feature. 

Rule  OCCLUSION-1:  For  any  pair  of  image  features  (lines, 
regions)  a,b  : 

if 

(  occluding  a  b  ) 

then 

hypothesize  (  closer  a  b  ). 

Semantics  can  supply  useful  clues,  when  partial 
interpretations  of  the  scene  are  already  available.  If,  for 
instance,  the  horizon  has  been  identified,  any  object  above  it 
must  be  in  the  sky  and  cannot  be  stationary.  Similarly,  the 
features  of  an  object  recognized  as  a  building  would  not  be 
considered  moving  in  an  ambiguous  situation. 

All  these  three  sources  of  information  (geometry,  spa¬ 
tial  reasoning  and  semantics)  contribute  to  the  creation  or 
the  elimination  of  interpretations  in  the  Qualitative  Scene 
Model.  The  set  of  rules  given  above  should  demonstrate  the 
idea  and  is  far  from  being  complete.  Since  their  underlying 
assumptions  vary  considerably,  conclusions  from  different 
groups  of  rules  carry  different  weights.  Conclusions  from 
geometry,  which  are  based  on  the  most  general  assumptions 
always  outrule  results  from  spatial  reasoning  or  semantics. 
Meta-knowledge  of  this  form  is  used  to  organize  the  interac¬ 
tion  of  the  different  sets  of  rules. 

THE  QUALITATIVE  SCENE  MODEL 

Since  the  proposed  model  is  vehicle-centered,  most 
information  contained  in  the  image  is  implicitly  part  of  the 
model.  In  addition  to  these  facts,  the  other  part  of  the 
model  forms  the  actual  interpretation  of  the  scene.  The 
feature-property  stationary/mobile  and  the  closer  relationship 


between  features  are  the  central  building  blocks  of  the  Qual¬ 
itative  Scene  Model. 

The  current  set  of  image  features  together  with  assigned 
properties  and  mutual  relationships  constitute  an 
interpretation  of  the  scene.  An  interpretation  is  feasible,  as 
long  as  it  is  free  of  internal  conflicts,  e.g.  ( closer  A  B)  and 
(closer  B  A).  The  Qualitative  Scene  Model  comprises  a  set 
of  feasible  interpretations  of  the  current  scene.  As  the  vehi¬ 
cle  proceeds,  more  information  about  the  structure  of  the 
scene  is  accumulated. 

Figure  4  shows  an  example,  how  a  network  of  closer- 
relationships  is  constructed  in  the  case  of  a  completely  sta¬ 
tionary  set  of  features  a,b,c,d.  Initially  nothing  is  known 
about  their  spatial  relationships,  except  that  (since  they  are 
visible)  all  are  in  front  of  the  image  plane.  This  situation  is 
reflected  by  the  graph  and  the  set  of  facts  in  Fig.  4a.  Since 
facts  of  the  form  (closer  O  a)  are  implicit,  they  are  only 
listed  for  the  initial  case.  As  more  findings  are  added  to  the 
list  of  facts,  the  tree  of  closer-relationships  keeps  growing, 
as  shown  in  Fig.  4b-d. 


o 


(closer  O  a) 
(closer  O  b) 
(closer  O  c) 
(closer  O  d) 


(a) 


(closer  b  c) 

(closer  a  b) 
(closer  a  c) 


(closer  d  c) 

(closer  a  b) 
(closer  a  c) 


(c) 


(d) 


Figure  4.  Successive  construction  of  closer  relationships,  (a)  Initial 
situation:  no  relative  depth  is  known  for  the  image  features  a,b,c,d.  They 
are  only  known  to  be  in  front  of  the  image  plane  O.  (b)  (closer  a  b)  has 
been  determined  and  is  added  to  the  list  of  facts,  (c)  ( closer  b  c)  has 
been  determined,  ( closer  a  c)  is  implied  by  transitivity,  (d)  ( closer  d  c) 
has  been  determined;  notice  that  at  this  point  nothing  can  be  said  about 
the  relationship  between  (a,d)  and  (b,d). 


3.  INTERPRETATION  AND  CONFLICT  RESOLUTION 

Using  a  set  of  rules  which  include  those  described  in 
the  previous  section,  the  Qualitative  Scene  Model  is  con¬ 
structed.  In  this  section  we  want  to  describe  how  the  model 
develops  over  time,  how  new  interpretations  of  the  scene  are 
generated,  and  how  conflicts  are  resolved. 


5  05 


Let  us  consider  a  simple  example  with  a  scene  contain¬ 
ing  only  three  feature  points  a,b,c  (see  Figure  5). 

Initially  (at  t=t0 )  nothing  is  known  about  the  spatial  relation¬ 
ships  between  these  points  and  whether  they  are  stationary 
or  not.  The  default  assumption  is  that  any  point  is  station¬ 
ary,  unless  there  is  an  indication  that  this  is  not  true.  The 
initial  interpretation  of  the  scene  thus  contains  only 

Interpretation  A(r0): 

( stationary  a) 

{stationary  b) 

( stationary  c) 

Suppose  that  between  t0  and  r,  all  three  points  show  somt 
amount  of  expansion  away  from  the  FOE,  giving  rise  to  the 
conclusion  (e.g.  by  rule  Geometry-2)  that  a  is  closer  (to  the 
vehicle)  than  b,  a  is  closer  than  c,  and  c  is  closer  than  b. 
From  the  information  gathered  up  to  this  point,  the  interpre¬ 
tation  of  the  scene  at  time  t,  looks  like  this: 

Interpretation  A^): 

(stationary  a) 

(stationary  b) 

(stationary  c) 

(i closer  a  b) 

(closer  a  c) 

(closer  c  b) 

At  time  i2  one  of  the  rules  claims  that  c  is  closer  than  a  and 
tries  to  assert  this  fact  into  the  current  interpretation. 
Clearly,  the  new  interpretation  would  contain  the  conflicting 

facts 

(closer  a  c)  and  ( closer  c  a)  , 

which  would  not  be  a  feasible  interpretation.  The  conflict  is 
resolved  by  creating  two  disjunct  hypotheses,  with  either  a 


or  c  as  mobile: 

Interpretation  B(/2): 

(mobile  a) 

( stationary  b) 

(stationary  c) 

(closer  c  b) 

Interpretation  C(r2): 

(stationary  a) 

(stationary  b) 

(mobile  c) 

(closer  a  b) 

Notice,  that  when  a  feature  is  hypothesized  to  be  mobile,  all 
its  closer-relationships  are  removed  from  the  interpretation. 
At  this  point  in  time,  two  feasible  interpretations  of  the 
scene  are  active  simultaneously.  All  active  interpretations  are 
pursued  until  they  enter  a  conflicting  state,  in  which  case 
they  are  either  branched  into  new  interpretations  or  removed 
from  the  Qualitative  Scene  Model. 

In  our  example  we  assume,  that  both  interpretations  B  and  C 
are  still  alive  at  time  rf.  At  this  point  some  rule  claims,  that  b 
should  be  closer  than  c.  This  would  not  create  a  conflict  in 
interpretation  C,  because  there  c  is  considered  as  mobile 
anyway.  Interpretation  B,  however,  cannot  ignore  this  new 
finding,  because  it  contains  the  contradiction  ( closer  c  b)! 
Again  we  could  branch  interpretation  B  into  two  new 
interpretation,  with  either  (mobile  b)  or  ( mobile  c).  This 
time,  however,  there  is  another  active  interpretation  (C), 
which  could  absorb  ( closer  b  c)  without  causing  an  internal 
conflict.  Thus  interpretation  B  is  not  branched  out,  but 
removed  altogether  from  the  model. 


Figure  5.  Development  of  a  scene  model  over  time:  At  time  tO  three  features  a,b,c  are  given,  which  are  initially 
assumed  to  be  stationary.  At  time  tl  three  closer-relationships  have  been  established  between  a,b,c.  All  features  are  still 
considered  stationary.  At  time  t2  a  conflict  arises  in  interpretation  A,  which  is  the  only  interpretation  active  at  this  time. 
Two  new  interpretations  (B  and  C)  are  created,  each  containing  one  feature  considered  mobile  (a  and  c).  At  time  d 
another  conflict  occurs  in  interpretation  B.  Since  an  interpretation  (C)  is  active  at  the  same  time,  which  could  absorb  the 
conflicting  fact  ( closer  c  b),  B  is  collapsed  and  C  remains  as  the  only  feasible  interpretation. 


506 


In  summary,  the  development  of  the  Qualitative  Scene 
Model  is  controlled  by  the  following  meta-ru\c$>. 

•  Initially  assume  that  all  features  belong  to  stationary 
objects.  There  is  only  one  initial  (root)  interpretation. 

•  After  a  hypothesis  has  been  created  by  one  of  the  analyz¬ 
ing  rules,  try  to  integrate  this  hypothesis  as  a  fact  into  every 
active  interpretation. 

•  If  the  new  fact  is  consistent  with  the  interpretation,  make 
it  a  part  of  this  interpretation. 

•  If  the  new  fact  is  not  consistent  with  the  interpretation, 
and  there  are  currently  no  other  active  interpretation,  then 
create  a  new  set  of  interpretations  containing  this  fact 
without  conflict 

•  Otherwise  delete  this  interpretation  from  the  model. 

The  algorithm  for  finding  the  FOE  and  derotating  the 
image  requires  a  set  of  image  points,  which  are  believed  to 
be  stationary.  Out  of  all  interpretations  active  at  the  same 
time,  the  most  conservative  interpretation  (i.e.  with  the  smal¬ 
lest  set  of  stationary  features)  is  selected  for  this  purpose. 
Also,  the  information  contained  in  the  Qualitative  Scene 
Model  is  made  available  to  processes  such  as  vehicle  con¬ 
trol,  navigation,  object  recognition,  motion  description  and 
threat  handling.  The  main  role  of  the  Qualitative  Scene 
Model  is  to  serve  as  a  representational  bridge  between  low- 
and  high-level  vision  and  to  provide  a  stable  reference  over 
time. 


4.  CONCLUSIONS 

The  main  difficulty  of  motion  understanding  in  the 
ALV  scenario  is  that  the  effects  of  vehicle  motion  and  actual 
movement  of  objects  in  the  scene  contribute  to  a  constantly 
changing  camera  image.  In  this  paper  we  presented  a  new 
approach  for  this  problem  which  departs  clearly  from  previ¬ 
ous  work.  Avoiding  a  purely  numerical  technique  we  advo¬ 
cate  a  qualitative  line  of  reasoning  and  modeling.  Multiple 
interpretations  of  the  dynamics  in  the  scene  are  pursued  at 
the  same  time,  allowing  a  very  flexible  way  of  reasoning. 
No  attempt  is  made  to  reconstruct  the  three-dimensional 
structure  of  the  scene  in  numerical  quantities.  Instead,  scene 
description,  motion  detection  and  occlusion  analysis  are 
accomplished  by  reasoning  based  on  a  Qualitative  Scene 
Model.  Different  sources  of  visual  information,  such  as 
motion,  spatial  reasoning  and  semantics  are  integrated  in  a 
rule-based  framework.  We  have  tried  to  show,  that  this 
qualitative  strategy  of  reasoning  and  modeling  can  provide 
significant  advantages  over  previous,  purely  numerical 
methods. 

Itie  work  reported  here  shows  the  conceptual  outline  of 
our  approach.  While  we  have  considered  only  point-features 
so  far,  the  integration  of  lines  and  regions  will  be  a  natural 
extension.  An  implementation  on  a  Symbolics  3670  is 
currently  under  way,  using  real  ALV  imagery  to  demonstrate 
the  benefits  of  our  approach. 


REFERENCES 

1.  B.  Bhanu  and  W.  Burger,  “Approximation  of  Displace¬ 
ment  Fields  Using  Wavefront  Region  Growing,” 
University  of  Utah,  Dept,  of  Computer  Science  UUCS 
86-111  (June  1986). 

2.  S.  Bharwani,  E.  Riseman,  and  A.  Hanson,  “Refinement 
Of  Environmental  Depth  Maps  Over  Multiple  Frames,” 
Proc.  IEEE  Workshop  on  Motion,  Kiawah  Island  Resort 
(May  1986). 

3.  T.  J.  Broida  and  R.  Chellapa,  “Kinematics  and  Struc¬ 
ture  of  a  Rigid  Object  from  a  Sequence  of  Noisy 
Images,”  Proc.  IEEE  Workshop  on  Motion,  Kiawah 
Island  Resort  (May  1986). 

4.  J.  Q.  Fang  and  T.  S.  Huang,  “Some  Experiments  on 
Estimatiing  the  3-D  Motion  Parameters  of  a  Rigid  Body 

from  Two  Consecutive  Image  Frames,”  IEEE  Trans. 
Pattern  Analysis  and  Machine  Intelligence  PAMI- 
6(5)  pp.  545-554  (September  1984). 

5.  G.  Hagert,  “What’s  in  a  mental  model?  On  conceptual 
models  in  reasoning  with  spatial  descriptions,”  Proc. 
9th  International  Joint  Conference  on  Artificial  Intelli¬ 
gence,  Morgan  Kaufmann  Publishers  (1985). 

6.  E.  C.  Hildreth  and  N.  M.  Grzywacz,  “The  Incremental 
Recovery  of  Structure  from  Motion:  Position  vs.  Velo¬ 
city  Based  Formulations,”  Proc.  IEEE  Workshop  on 
Motion,  Kiawah  Island  Resort  (May  1986). 

7.  R.  Jain,  S.  L.  Bartlett,  and  N.  O’Brien,  “Motion  Stereo 
Using  Ego-Motion  Logarithmic  Mapping,”  Proc.  IEEE 
Conf.  Computer  Vision  and  Pattern  Recognition  (1986). 

8.  B.  Kuipers,  “Qualitative  Simulation,”  Artificial  Intelli¬ 
gence  29  pp.  289-338  (1986). 

9.  D.  N.  Lee,  “The  optic  flow  field:  the  foundation  of 
vision,”  Phil.  Trans.  R.  Soc.  Lond.  B  290  pp.  169-179 
(1980). 

10.  H.  C.  Longuet-Higgins  and  K.  Prazdny,  “The  interpre¬ 
tation  of  a  moving  retinal  image,”  Proc.  R.  Soc.  Lond. 
B  208  pp.  385-397  (1980). 

11.  A.  Z.  Meiri,  “On  Monocular  Perception  of  3-D  Moving 
Objects,”  IEEE  Trans.  Pattern  Analysis  and  Machine 
Intelligence  PAMI-2(6)  pp.  582-583  (November  1980). 

12.  A.  Mitiche,  S.  Seida,  and  J.  K.  Aggarwal,  “Determin¬ 
ing  Position  and  Displacement  in  Space  from  Images,” 
Proc.  IEEE  Conf.  Computer  Vision  and  Pattern  Recog¬ 
nition  (June  1985). 

13.  H.-H.  Nagel,  “Image  Sequences  -  Ten  (octal)  Years  - 
From  Phenomenology  towards  a  Theoretical  Founda¬ 
tion,”  Proc.  Intern.  Conf.  on  Pattern  Recognition,  Paris 
(1986). 

14.  K.  Prazdny,  “Determining  the  Instantaneous  Direction 
of  Motion  from  Optical  Flow  Generated  by  a  Curvi¬ 
linear  Moving  Observer,”  Computer  Graphics  and 
Image  Processing  17  pp.  238-248  (1981). 


587 


15.  K.  Prazdny,  “On  the  Information  in  Optical  Flows,” 
Computer  Vision,  Graphics,  and  Image  Processing 
22  pp.  239-259  (1983). 

16.  D.  Regan,  K.  Beverly,  and  M.  Cynader,  “The  Visual 
Perception  of  Motion  in  Depth,”  Scientific  American, 

pp.  136-151  (July  1979). 

17.  J.  H.  Rieger,  “Information  in  optical  flows  induced  by 
curved  paths  of  observation,”  J.  Opt.  Soc.  Am 
73(3)  pp.  339-344  (March  1983). 

18.  H.  Shariat  and  K.  E.  Price,  “How  to  Use  More  Than 
Two  Frames  to  Estimate  Motion,”  Proc.  IEEE 
Workshop  on  Motion,  Kiawah  Island  Resort  (May 
1986). 

19.  R.  Y.  Tsai  and  T.  S.  Huang,  “Uniqueness  and  Estima¬ 
tion  of  Three-Dimensional  Motion  Parameters  of  Rigid 
Objects  with  Curved  Surfaces,”  IEEE  Trans.  Pattern 
Analysis  and  Machine  Intelligence  PAMI-6(1)  pp.  13- 
27  (January  1984). 

20.  S.  Ullman,  The  Interpretation  of  Visual  Motion,  MIT 
Press,  Cambridge,  Mass.  (1979). 

21.  S.  Ullman,  “Maximizing  Rigidity:  The  Incremental 
Recovery  of  3-D  Structure  from  Rigid  and  Rubbery 
Motion,”  MIT  A.I.Memo  No.  721  (June  1983). 

22.  Y.  Yasumoto  and  G.  Medioni,  “Experiments  in  Estima¬ 
tion  of  3-D  Motion  Parameters  from  a  Sequence  of 
Image  Frames,”  Proc.  IEEE  Conf.  Computer  Vision 
and  Pattern  Recognition  (1985). 

23.  X.  Zhuang  and  R.  M.  Haralick,  “Rigid  Body  Motion 
and  the  Optic  Flow  Image,”  IEEE  Conference  on  AI 
Applications  (December  1984). 


588 


What  Is  A  “Degenerate”  View? 


John  R.  Kender1 
David  G.  Freudenstein 


Department  of  Computer  Science 
Columbia  University 
New  York,  New  York  10027 


1  Abstract 

In  this  paper,  we  attempt  to  quantify  what  is  meant  by  the 
terms  "degenerate  view",  and  its  relatives,  "characteristic 
view”,  "visual  event",  and  "general  viewing  position".  We 
propose  that  the  definition  of  degeneracy  is  itself  degenerate, 
taking  on  differing  meanings  at  different  times.  We  claim  (at 
least  for  the  case  of  polyhedra)  that  one  can  only  speak  of  a 
two-dimensional  stimulus  as  being  degenerate  with  respect  to 
a  given  heuristic  for  inverting  the  image  function.  Additionally, 
we  show  that  given  the  finite  viewing  resolution  of  a  two- 
dimensional  retina,  in  practice  the  concept  of  a  characteristic 
/aw  is  often  not  characteristic  of  real  imagery.  Even  precisely 
defined  general  viewing  positions  are  sensitive  to  camera 
acuity:  any  viewpoint  ceases  to  be  characteristic  at  some 
resolution,  and  non-characteristic  views  are  not  vanishingly 
improbable.  We  provide  initial  quantitative  estimates  on  these 
probabilities  for  some  simple  cases,  and  relate  them  to  a 
minimal  disambiguation  distance.  It  follows  that  an  aspect 
graph  is  less  a  discrete  graph,  and  more  properly  a  partitioning 
of  the  surface  of  the  viewing  sphere  into  "fuzzy"  regions  of 
non-zero  area:  an  aspect  map.  This  viewpoint  is  more  in 
keeping  with  recent  and  proposed  work  on  optimal  viewing 
strategies. 


2  Introduction 

Robotic  vision  systems  must  both  obtain  images  and 
analyze  them.  However,  a  primary  characteristic  of  many 
realistic  imaging  situations  is  that  the  data  acquisition  is  much 
less  costly  than  the  subsequent  data  analysis.  In  such 
domains  it  is  therefore  reasonable  to  dedicate  significant 
computational  effort  towards  the  task  of  calculating  an  optimal 
viewing  point  for  the  next  image  capture.  Defining  and 
obtaining  this  optimum  is  necessarily  probabilistic;  it  must 
incorporate  an  understanding  of  the  limits  of  resolution  of  the 
camera,  and  of  the  limits  of  resolution  of  the  placing  agent. 
The  overall  goal  is  to  obtain  maximal  information  from  a 
sequence  of  inexact  images  in  inexact  placements,  while 
minimizing  some  work  function  which  expresses  the  relative 
costs  of  image  acquisition  and  image  analysis.  Such 
calculations  necessarily  place  a  heavy  premium  on  avoiding 
what  are  often  referred  to  as  "degenerate  views”. 


’This  research  was  supported  In  part  by  ARPA  grant 
#N00039-84-C-01 65,  by  a  NSF  Presidential  Young 
Investigator  Award,  and  by  Faculty  Development  Awards  from 
AT&T,  Ford  Motor  Co.,  and  Digital  Equipment  Corporation. 


Nevertheless,  it  is  not  apparent  what  makes  a  view 
degenerate,  how  such  a  view  is  recognized  or  forecast,  or 
even  whether  such  views  are  rare  or  commonplace.  Thus,  the 
first  concern  of  this  paper  is  to  define  and  quantify  the 
meaning  of  the  term  "degenerate",  and  to  show  the  varying 
imaging  contexts  in  which  it  can  arise.  Secondly,  we  suggest 
a  representation  useful  for  calculating  the  likelihood  of  such 
views  (whatever  their  definition);  it  takes  the  form  of  mappings 
over  the  viewing  sphere.  This  representation  extends  existing 
work  on  aspect  graphs  by  explicitly  incorporating  the  known 
limits  of  visual  acuity.  It  also  leads  directly  to  methods  of 
associating  with  each  view  a  probability  of  its  being  attained, 
and  the  placement  cost  of  attaining  such  a  view. 

Therefore,  as  a  first  step  to  a  general  theory  of  sensing 
strategies  for  finite  resolution  imagery,  we  use  these  aspect 
maps  to  quantify  degeneracy.  Although  this  paper  is  limited  to 
an  analysis  of  the  issue  from  the  perspective  of  two- 
dimensional  (areal)  imagery,  it  is  clear  that  related  problems 
occur  with  other  types  of  sensors,  whether  they  are  point-, 
line-,  or  area-based,  and  whether  they  sample  distance, 
orientation,  reflectance,  motion,  or  combinations  of  them. 


3  Compact  and  Partial  Survey  of  Existing 
Approaches 

How  to  model  the  viewpoints  of  a  three-dimensional  object 
is  related  to  the  general  question  of  three-dimensional 
modeling  [Requicha  80],  Most  schemes  have  traditionally 
been  classified  as  either  "Boundary  Representations"  (which 
represent  the  enclosing  surfaces  of  the  object)  or 
"Constructive  Solid  Geometry"  (the  boolean  combination  of 
volumetric  primitives).  [Ballard  and  Brown  82]  classify  as  a 
third  category  those  representations  based  on  sweeping  a 
two-dimensional  region  along  some  specified  space  curve, 
although  one  might  think  of  these  “generalized  cylinder" 
representations  as  an  extension  of  Constructive  Solid 
Geometry. 

A  priori,  there  will  be  an  Infinite  number  of  viewpoints  from 
which  a  given  object  will  be  visible  from  a  fixed  distance.  The 
"characteristic  view"  representation  attempts  to  solve  this 
problem  by  partitioning  the  Infinite  views  into  a  finite  set  of 


589 


viewpoint  classes.  Each  viewpoint  class  represents  all  of  the 
(infinite)  various  viewpoints  from-  which  a  specific  set  of  faces 
of  the  object  will  be  visible1. 

Thus,  an  implicit  objective  of  the  “characteristic  view" 
representation  of  a  three-dimensional  object  is  not  only  to 
encode  the  object's  three-dimensional  structure  using  a  type  of 
boundary  representation,  but  exp^citly  to  represent  the 
possible  equivalence  classes  of  images  obtained  by  a  camera 
at  any  of  the  infinite  possible  viewpoints.  Such  information  Is 
rich  in  the  sense  that  it  relates  directly  to  the  procedural 
information  necessary  to  instantiate  models  and  to  determine 
future  viewpoints. 

One  of  the  more  successful  three-dimensional  vision 
systems  to  date  is  the  ACRONYM  system  [Brooks  81],  which 
combines  inforr.  tion  from  all  levels  of  computer  vision. 
Binford  describes  the  system  as  ‘generic  with  respect  to 
observation,  that  is,  ACRONYM  is  insensitive  to  viewpoint  and 
flexible  with  varied  sensor  inputs."  ( [Binford  82],  p.  38). 
However,  it  is  worth  noting  that  the  success  of  ACRONYM  has 
been  achieved  without  the  need  to  compare  images  obtained 
from  different  viewpoints  of  the  same  object  (other  than  stereo 
pairs  taken  from  a  “characteristic"  position:  directly  above  the 
basically  horizontal  objects). 

Several  schemes  for  representing  the  set  of  possible 
"characteristic  views"'  for  a  given  object  have  been  proposed 
[Ane-id;,  ue  Fioriani,  and  Falcidieno  85;  Callahan  and  Weiss 
oo;  Wong  and  Fu  84;  Fekete  and  Davis  84].  Several  of  these 
have  dealt  explicitly  only  with  smooth  objects  such  as  tori. 
One  important  question  which  has  not  yet  been  resolved  is 
whether  the  ideal  representation  should  be  general  enough  to 
cover  both  smooth  objects  and  polyhedra,  or  whether  it  is 
more  efficient  to  use  different  schemes  for  both.  Of  course,  in 
tho  latter  case,  it  remains  further  to  be  determined  how  to 
integrate  such  representations  in  imagery  containing  both 
types  of  objects  simultaneously. 

[Callahan  and  Weiss  85]  use  a  graph  embedded  on  the 
viewing  sphere  as  a  model  for  surface  shape.  They  present 
many  useful  examples  of  their  model,  although  they  describe 
only  smooth  surfaces.  Borrowing  from  [Koenderink  and  van 
Doom  79],  visual  "events"  are  clearly  encompassed  by  the 
representation,  and  analogues  to  "typical  views"  for  several 
objects  (bumpy  sphere,  tori,  etc.)  are  presented.  However, 
there  is  no  apparent  connection  between  their  formalism  and 
image  observables;  many  of  the  examples  imply  semi¬ 
transparent  objects. 

One  recent  approach  [Fekete  and  Davis  84]  apparently 
works  for  either  polyhedral  or  smooth  objects.  A  new  data 
structure  ( "hierarchic  trixels")  for  representing  a  function  on 
the  viewing  sphere  is  presented,  which  explicitly  encodes  a 
discrete  sampling  of  the  values  of  an  arbitrary  function  of  the 
image,  measured  on  images  generated  from  all  possible 
viewpoints.  Object  recognition  is  performed  based  on  this  set 
of  function  values.  This  resulting  function  on  the  sphere,  the 
"property  sphere",  can  be  used  to  generate  aspect  graphs. 

it  is  a  significant  result  of  this  work  that  the  two-dimensional 
analog  of  the  property  sphere,  the  "property  circle",  has 
proved  sufficient  for  useful  object  recognition  tasks.  Another 


'A  recent  application  of  this  concept  to  30  object  representation  in  the 
context  of  bin-picking  tasks  is  reported  in  [Ikeuchi  86],  In  particular,  an 
interpretation  tree  is  generated  which  distinguishes  among  viewpoints  based 
on  the  set  of  visible  faces. 


noteworthy  aspect  of  the  work  of  [Fekete  and  Davis  84]  Is  that 
the  arbitrary  function  used  until  now  has  (intentionally)  been  a 
metric  of  relatively  low  Information  content,  e.g.  the  average 
radius  of  the  image  from  the  given  viewpoint.  This  metric 
reduces  area  information  in  an  image  to  a  small  number  of 
scalars  (here,  one),  in  effect  turning  the  camera  into  a  point- 
based  sensor.  It  remains  to  be  seen  whether  other  intuitively 
“higher"  information-content  metrics  will  perform  better.  In 
spirit,  the  property  circle  work  has  much  in  common  with  the 
work  of  Grimson  on  sensing  strategies  for  point-based  sensors 
[Grimson  85],  where  the  concept  of  trixel  has  as  its  analogue 
the  Chebyshev  point  of  the  bounding  polygon  of  a  projected 
object  surface. 

The  property  sphere  concept  has  been  extended  by  [Kom 
and  Dyer  86],  who  specify  a  number  of  useful  algorithms  for 
constructing  and  manipulating  the  property  sphere 
representation.  Specifically,  in  order  to  minimize  redundant 
data,  Korn  and  Dyer  have  extended  the  concept  of  2-D  region 
growing  [Ballard  and  Brown  82],  This  method  saves  both 
space  and  time  in  further  operations,  by  merging  "viewpoints 
into  maximally  connected  regions  such  that  all  viewpoints  of  a 
region  have  the  same  value  of  a  given  property  P'  (p.22,  ibid.). 
An  analysis  is  presented  of  a  particular  implementation  of  the 
property  sphere. 

The  viewpoint-modeling  approaches  mentioned  so  far  are 
motivated  by  the  ultimate  goal  of  performing  object 
recognition.  However,  similar  viewpoint  models  are  important 
for  another  purpose:  planning  camera  positions.  Planning 
future  viewpoints  will,  of  course,  be  part  of  most  object 
recognition  schemes.  Nonetheless,  it  may  be  viewed  as  a 
distinct  subgoal  of  object  recognition,  and  sometimes  as  a 
totally  independent  task. 

Thus,  in  contrast  to  the  approaches  mentioned  so  far, 
[Canny  84]  does  not  explicitly  model  the  various  viewpoints  of 
a  given  object.  Rather,  his  goal  is  to  characterize  "the  regions 
in  space  from  which  a  given  feature”  is  either  entirely  or  partly 
visible.  Working  in  the  context  of  3-D  mechanical  part 
inspection,  Canny  deals  with  the  question  of  planning  camera 
positions. 


4  Degenerate  Viewpoints  in  Theory 

If  it  is  to  be  useful,  a  representation  for  the  views  of  a  three 
dimensional  object  or  object  assembly  must  give  some  insight 
into  those  viewing  positions  which  are  less  helpful  in  resolving 
ambiguities  of  object  structure,  position,  or  orientation.  We 
present  two  common  views  of  what  such  degeneracy  is,  show 
that  they  are  deficient,  and  redefine  them  in  ways  that  are 
more  quantifiable. 

For  now,  we  will  approach  the  problem  of  image 
degeneracy  purely  theoretically,  under  the  assumption  of  a 
camera  with  infinite  resolution  under  orthographic  projection. 
Later  we  will  make  resolution  finite;  and  although  we  will  not 
address  the  issue  directly,  we  will  occasionally  allude  to  the 
problems  of  perspective  projection. 


4.1  Slight  Movements  Giving  Drastic  Changes? 

Perhaps  the  simplest  example  of  ambiguity  is  the  case  of  a 
head-on  view  of  a  cube,  which  is  ideally  imaged  as  a  square. 
Such  an  image  has  often  been  noted  as  giving  no  information 
as  to  the  three-dimensionality  of  the  object,  and  has  therefore 


I 

4 


590 


been  described  as  a  degenerate  image  (see  for  example, 
(Kanade  80;  Sabbah  82]  ).  Degeneracy  in  this  context, 
however,  refers  to  the  fact  that  a  “slight"  change  in  the 
viewpoint  which  generated  the  image  would  cause  a  “drastic" 
change  in  the  image  (sometimes  called  an  “image  event”). 

This  definition  (and  related  descriptions  of  what  makes  a 
viewing  position  general  or  an  image  characteristic 
[Chakravarty  82])  is  inadequate  in  two  ways: 

1.  It  is  vague  with  respect  to  the  meanings  of 
“slight”  and  “drastic". 

2.  It  does  not  encompass  all  the  phenomena  that 
would  seem  to  be  properly  described  as 
examples  of  degeneracy. 

What  changes  drastically  in  the  cube-as-square  image  can 
be  characterized  in  many  ways:  the  number  of  regions 
change,  the  topology  of  the  image  regions  is  altered,  apparent 
symmetries  are  modified,  and  (if  lines  are  labeled  in  the 
Huffman-Clowes  manner)  the  junctions  are  relabelled  (cf. 
[Lavin  74;  Thorpe  and  Shafer  83]).  Still  other  derived 
properties  of  the  image  change,  too.  More  generally, 
depending  on  the  means  of  analysis,  this  view  of  the  cube 
would  be  called  degenerate  if  a  slight  change  in  viewpoint 
would  alter  the  “quality"  of  the  ensemble  of  extracted  image 
features  used  in  shape  analysis.  A  rigorous  definition  of  this 
“quality"  change  must  then  include  the  requirement  that  a 
qualitative  change  is  one  that  ultimately  affects  the  derivation 
of  those  object's  “semantic"  properties  (such  as  identity,  scale, 
rotation,  coloring,  etc.)  considered  important  by  the  system. 

The  drastic  change  is  therefore  a  drastic  change  In 
interpretation,  not  in  image.  Therefore  the  perception  of  a 
drastic  change  can  vary  from  system  to  system.  For  example, 
if  the  system  distinguishes  cubes  from  spheres  by  detecting 
the  presence  or  absence  of  long  straight  lines,  the  cube-as- 
square  cannot  be  considered  degenerate.  (And,  in  fact,  if  the 
cube  is  the  only  model  in  the  system  at  all,  no  view  is  ever 
degenerate.) 

What  is  hiding  behind  this  implicit  definition  of  "drastic"  is 
the  interpretation  equivalence  relation;  a  drastic  change  is  a 
change  of  interpretation  equivalence  classes.  However,  since 
in  many  systems  the  interpretation  classes  are  inherited  from 
the  feature  equivalence  classes,  commonly  the  drastic  change 
has  been  attributed  to  image  characteristics  alone. 

Similarly,  the  notion  of  "slight  movement"  is  imprecise. 
What  is  usually  implicit  in  such  definition  is  that  there  is  at  least 
one  direction  in  which  an  arbitrarily  small  movement  of  the 
camera  causes  the  drastic  change.  (Usually  the  direction  is  on 
a  line  perpendicular  to  an  image  edge).  The  meaning  of 
“arbitrarily  small"  only  appears  to  make  sense  when  taken  in 
the  sense  of  mathematical  analysis.  That  is,  the  drastic 
change  must  occur  for  positive  movements  of  magnitude  less 
than  some  epsilon,  in  the  direction  of  degeneracy. 

Such  a  definition  would  imply  that  degeneracy  can  be 
qualified  as  a  matter  of  varying  degree,  although  this  Is 
apparently  never  stated.  That  is,  what  can  vary  from 
degeneracy  to  degeneracy  is  the  number  of  possible 
(“qualitative")  drastic  changes,  and  the  relative  number  of 
directions  in  which  such  resolutions  (or  non-resolutions)  occur. 
For  example,  the  cube-as-square  Image  can  resolve  itself  into 
an  image  with  either  two  or  three  regions,  with  the  two  region 
image  possible  only  for  four  discrete  directions  of  camera 


movement.  All  other  directions  resolve  it  into  an  image  with 
three  regions,  even  If  some  of  those  regions  are  vanishingly 
thin.  Further,  some  "degenerate"  views  sometimes  do  not 
resolve  at  all.  For  example,  the  cube  imaged  as  two 
rectangles  remains  two  rectangles  for  two  discrete  directions 
of  movement,  resolving  Itself  into  three  regions  under 
movement  in  all  other  directions.  The  space  of  allowable 
degeneracies  is  apparently  very  large,  and  perhaps  can  be 
quantified  in  absolute  terms  as  some  measure  defined  on  the 
ways  in  which  the  view  falls  to  resolve  Into  something  more 
"characteristic". 

The  converse  of  the  common  “small  gives  drastic"  definition 
is  perhaps  easier  to  implement.  This  converse  definition  is 
stated  as  follows:  An  image  is  seen  from  a  “general 
viewpoint"  if  there  is  some  positive  epsilon  for  which  camera 
movements  in  any  direction  can  be  taken  without  effect  on 
resulting  semantic  analysis.  Here,  too,  the  definition  is  based 
ultimately  on  system  performance;  an  image  can  be 
degenerate  to  one  system  but  not  another.  Note  that  this 
definition  of  degeneracy  need  not  directly  appeal  to  any 
consideration  of  three  dimensional  models. 


4.2  Unlikely  Views? 

However,  even  this  definition  of  a  degenerate  viewpoint  is 
incomplete;  basically  it  says  that  generality  is  a  form  of 
stability.  However,  many  viewpoints  are  stable  yet  ought  to  be 
considered  degenerate,  at  least  in  the  sense  that  that  they  are 
less  likely  to  allow  a  system  to  instantiate  a  proper  model  than 
other  viewpoints  do. 

Consider  a  pyramid  with  a  square  base  and  arbitrary  height. 
(Customarily,  it  has  equilateral  triangles  for  its  sides,  but  we 
relax  that  restriction.)  Imaged  from  many  viewpoints  from 
below,  its  image  appears  to  be  a  type  of  rhomboid:  a  tilting 
square.  None  of  these  viewpoints  is  degenerate  according  to 
the  stability  definition  above,  since  a  slight  change  in  viewpoint 
does  not  cause  a  drastic  change  in  the  image;  it  merely  tilts 
the  rhomboid.  In  fact,  there  is  a  great  deal  of  viewpoint 
freedom,  and  many  views  appear  to  yield  the  same  semantic 
result:  a  partly  instantiated  pyramid  with  height  information 
largely  missing.  What  is  most  disturbing  about  such  views  is 
that  they  are  potentially  the  most  common.  For  a  very  flat 
pyramid,  such  rhomboids  appear  from  nearly  half  of  all  viewing 
directions. 

Yet  it  seems  plausible  to  suggest  that  these  particular  views 
be  considered  at  least  partially  degenerate;  in  contrast  to 
some  other  views,  these  images  give  little  information  about 
how  to  instantiate  the  pyramid's  height.  (They  do  place  weak 
upper  limits  on  the  height:  the  peak  is  constrained  to  heights 
that  keep  it  invisible.)  Further,  if  our  model  base  were  more 
complete,  we  would  not  be  able  to  distinguish  such  a  view 
from  among  similar  views  of  a  triangular  wedge  with  square 
base,  or  any  similarly  tapering  polyhedron  with  a  square  base, 
despite  the  stability  of  our  vantage  point.  Thus,  we  may  wish  to 
include  in  our  definition  of  degeneracy  those  viewpoints  from 
which  "relatively  little”  three-dimensional  information  may  be 
obtained,  regardless  of  stability. 

Operationally,  this  aspect  of  degeneracy  can  be  quantified 
as  the  expected  number  of  additional  views  necessary  to 
disambiguate  the  object;  a  degenerate  view  is  therefore  one 
that  is  relatively  uninformative  and  will  require  more  images. 
This  number  clearly  depends  on  the  complexity  of  the  model 
data  base,  the  intelligence  of  the  system  procedures  for 


591 


determining  the  “best"  next  view,  and  the  (necessarily) 
heuristic  procedures  of  the  system  for  inverting  the  image 
projection.  Again,  this  is  a  system  performance  definition,  and 
an  image's  degree  of  degeneracy  would  change  as  system 
parameters  change. 

It  would  appear  that  this  definition  must  be  probabilistic.  It  is 
not  hard  to  conceive  of  objects  or  object  assemblies  in  which 
multiple  viewpoints  give  identical  images,  but  for  which  the 
resolution  into  a  single  interpretation  takes  varying  strategic 
paths  depending  on  the  differing  image  features  that  can 
appear  in  the  second  view.  (Take  as  an  example  a  cube  with 
a  single  distinguished  face,  viewed  initially  so  that  only  one 
non-distinguished  face  is  visible.)  The  likelihood  of  taking 
each  path  can  be  quantified:  the  most  inclusive  measure  of  an 
image’s  degeneracy  would  be  then  be  its  probability-labeled 
search  tree.  Various  measures  based  on  the  full  tree  (one  of 

which  is,  of  course,  is  expected  depth)  could  also  serve  as  a 
measure  of  degeneracy.  Under  this  definition,  tree  breadth 
has  no  strong  role:  a  degenerate  image  can  resolve  itself  into 
hundreds  of  images,  but  as  long  as  each  new  image  was 
interpretable,  the  original  image  is  no  less  degenerate  than  the 
pyramid  viewed  from  below. 

It  is  interesting  to  note  a  paradoxical  consequence  of  this 
systems'  view  of  degeneracy.  As  a  system’s  power  increases 
due  to  the  availability  of  more  sophisticated  shape  analyzing 
tools  (such  as  when  shape  from  skewed  symmetry  is  used 
with  shape  from  shading),  more  types  of  ambiguity  are 
possible.  Each  method  brings  with  it  a  weakness.  The 
implication  is  that  vision  systems  with  multiple  sources  of 
knowledge  must  know  when  to  ignore  a  source  undergoing 
degeneracy.  This  meta  knowledge  can  be  explicitly  coded,  or 
implicitly  handled  by  means  of  a  flexible  enough 
representation  that  permits  "don’t  know  much"  as  a  valid 
answer. 


5  Specific  Imaging  Degeneracies 

Considering  now  only  images  of  polyhedral  objects,  it  is 
possible  to  give  a  catalogue  of  image  degeneracies.  Each  is 
based  on  a  specific  heuristic  for  inverting  the  three-dimension 
to  two-dimension  image  function.  The  list  is  partial,  and  omits 
some  heuristics  that  are  even  more  fundamental,  such  as  the 
generally  assumed  heuristic  rules  that  lines  in  the  image  have 
been  caused  by  lines  in  three-space.  (In  this  last  case,  this 
would  imply  that  any  planar  curve  imaged  from  within  its  plane 
would  often  be  considered  as  degenerate,  if  the  system  were 
unable  to  interpret  it  as  other  than  a  linear  object.) 


5.1  Vertices  Imaged  in  the  Plane  of  Scene  Edges 

Apparently  the  major  source  of  "degenerate  views"  in  the 
blocks  world  (see  Figure  1),  so-called  coincidental  alignments 
occur  when  the  image  of  a  vertex  appears  to  fall  on  the  image 
of  an  edge.  They  confound  the  basic  imaging  assumption  that 
three  or  more  lines  coincident  in  the  image  are  coincident  in 
the  scene.  If  the  scene  is  analyzed  using  labelling,  the 
labelling  will  fail.  The  image  is  then  degenerate  because 
another  image  is  required.  In  theory,  such  coincidental 
alignments  have  probability  zero,  since  the  camera  must  lie  on 
a  specific  plane  (or  more  precisely,  in  the  infinite  intersection 
of  two  co-planar  half-planes). 


2see  Figure  5. 


5.2  Parallel  Scene  Lines  Imaged  in  Their  Own  Plane 

This  is  one  of  the  degeneracies  observed  with  the  cube2.  It 
violates  the  heuristic  that  colinear  in  the  image  Implies  collnear 
in  the  scene,  a  heuristic  often  not  used.  Hence,  it  is  system- 
sensitive.  In  theory,  it  also  has  probability  zero,  although  the 
camera  placement  is  somewhat  more  free  than  in  the  case  of 
vertex-on-edge. 

5.3  Coincident  Scene  Lines  Imaged  in  Their  Own 
Plane 

This  is  a  special  case  violation  of  linear  in  the  image  implies 
linear  in  space.  Again,  this  is  system  dependent  and  has 
probability  of  zero. 

Figure  6  illustrates  one  example  of  this  degeneracy  while 
viewing  a  pyramid. 


5.4  Perfect  Symmetry 

This  is  an  interesting  extreme  case,  and  one  apparently 
avoided  by  professional  photographers  as  it  appears  to  flatten 
relief3.  It  is  apparently  based  on  the  heuristic  that  symmetry  in 
the  image  implies  symmetry  in  the  scene  perpendicular  to  the 
line  of  sight.  Analogous  to  what  happens  when  a  cube  is 
imaged  as  two  congruent  rectangles,  it  is  often  degenerate 
since  perfectly  symmetric  images  lack  the  cues  to  depth  that 
broken  (skewed)  symmetries  provide.  It  has  a  probability  of 
zero  of  occurring  ideally,  although  the  camera  now  has  the 
freedom  to  move  in  at  least  one  entire  plane. 


6  The  Effect  of  Finite  Resolution  on  Specific 
Imaging  Degeneracies 

The  various  viewing  points  for  our  camera  may  be  modeled 
as  points  on  a  viewing  sphere  at  whose  center  lies  the  object 
of  interest.  Therefore,  in  the  ideal  case  of  the  cube  with  infinite 
resolution  under  orthography  (or,  for  that  matter,  under 
perspective)  there  are  precisely  six  viewing  directions  from 
which  we  see  exactly  one  face  of  the  cube  and  no  more. 
Similarly,  a  family  of  three  mutually  orthogonal  great  circles 
which  intersect  at  these  six  points  determine  the  set  of 
directions  from  which  we  would  see  exactly  two  faces  of  the 
cube.  Anywhere  else  on  the  sphere  we  see  three  faces  (see 
Figure  2).  A  point  on  the  sphere  chosen  at  random  will  be  a 
viewpoint  imaging  three  faces  with  probability  1 . 

Of  course,  any  real  system  will  have  only  finite  resolution. 
How  this  resolution  is  measured,  and  how  repeatable  it  is,  can 
vary  depending  on  application.  For  the  cube,  resolution 
appears  to  be  the  ability  to  separate  two  nearly  concurrent 
parallel  lines  into  their  separate  sources.  (Under  perspective, 
the  parallel  lines  would  only  be  nearly  parallel.)  Assuming  this 
resolvability  of  parallels  is  independent  of  the  line  segment 
lengths  (admittedly,  this  Is  somewhat  unrealistic),  then  the 
zero  probabilities  of  degeneracy  become  finite.  On  the 
viewing  sphere,  the  great  circles  have  become  bands,  and  the 
points  of  single  face  viewing  have  become  spherical  squares. 
(See  Figure  3).  Their  relative  areas  (and  hence  the  probability 
of  degeneracy)  are  straightforward  to  compute  in  terms  of 
camera  acuity.  The  less  accurate  the  camera,  or  the  farther  it 
is  away,  or  the  smaller  the  object,  the  larger  the  likelihood  that 
a  viewpoint  is  degenerate.  In  the  extreme,  the  bands  merge, 
and  no  viewpoint  sees  a  "characteristic"  view:  the  images  are 
infinitely  degenerate. 


®we  provide  a  straightforward  example  of  perfect  symmetry  in  Figure  7. 


592 


The  edges  of  the  bands,  however,  cannot  be  sharp. 
Although  the  bands  partition  the  surface  of  the  viewing  sphere, 

their  borders  represent  those  viewing  directions  at  which 
parallel  lines  are  “first"  seen  as  two  lines.  Given  camera 
inaccuracy  and  noise,  this  transition  to  resolvability  cannot  be 
sudden.  Depending  on  the  camera  and  the  accuracy  of  the 
algorithms  processing  its  data,  repeatability  may  best  be 
represented  by  a  fuzzy  boundary.  Thus  instead  of  each  point 
on  the  sphere  having  a  label,  it  has  a  vector  of  (label, 
likelihood)  pairs.  Each  degeneracy  region,  then,  fades  away  in 
likelihood  as  it  extends  farther  from  its  ideal  point  or  great 
circle.  The  actual  computation  of  the  shape  of  this  probability 
density  depends  on  the  image  heuristic  used  for  image 
function  inversion,  as  well  as  some  measure  of  camera  and 
software  precision.  Nevertheless,  the  relative  size  of  the 
integral  of  probability  over  the  partition  can  give  an  accurate 
estimate  of  the  likelihood  that  a  particular  view,  degenerate  or 
not,  will  be  visible  after  a  random  camera  placement. 

Thus: 


6.1  Vertices  Imaged  in  the  Plane  of  Scene  Edges 

The  system  must  separate  vertices  from  lines.  Assuming  a 
basic  radius  of  confusion  around  the  vertex,  under  orthography 
the  degeneracy  region  on  the  sphere  is  a  band  segment.  The 
band  width  is  inversely  proportional  to  accuracy,  but  the  band 
position  on  the  viewing  sphere  and  its  angular  extent  are 
determined  by  the  geometry  of  the  scene.  Basically,  the  band 
segment  lies  in  the  infinite  intersection  of  two  co-planar  half¬ 
planes  determined  by  the  vertex  in  the  scene  and  the  line  in 
the  scene.  Thus,  the  viewing  sphere  of  a  polyhedral  assembly 
of  object  is  criss-crossed  with  such  bands;  there  are 
approximately  as  many  as  there  are  vertices  times  the  number 
of  edges.  (Not  quite  true;  both  the  vertex  and  the  edge  must 
be  visible.)  Degenerate  views  are  therefore  rather  common  in 
practice,  and  become  more  so  as  the  camera  recedes. 


6.2  Parallel  Scene  Lines  Imaged  in  Their  Own  Plane 

Again,  this  is  the  case  of  the  finite  resolution  cube, 
assuming  that  parallelism  detection  is  independent  of  line 
length.  The  degeneracy  region  is  a  band.  This  is  likely  to  be  a 
common  source  of  degeneracy  if  the  imaging  environment 
responds  strongly  to  gravitational  influences:  there  will  be 
many  parallel  vertical  lines. 


6.3  Coincident  Scene  Lines  Imaged  in  Their  Own 
Plane 

This  source  of  degeneracy  depends  on  the  accuracy  of  an 
angle  detector.  The  region  of  degeneracy  is  again  band-like, 
with  width  depending  on  the  detector,  and  position  depending 
on  the  scene;  the  center  of  the  band  lies  in  the  same  plane  as 
the  lines.  The  exact  configuration  is  heavily  dependent  on 
actual  system  performance.  It  may  be  a  common  source  of 
degeneracy  in  the  blocks  world. 


6.4  Perfect  Symmetry 

Again,  this  is  heavily  dependent  on  detector  performance, 
but  it  probably  appears  as  a  fairly  wide  (but  less  probable) 
band  centered  over  the  axis-of  symmetry.  The  relatively  large 
width  comes  from  the  fact  that  symmetry  is  a  global  property 
and  its  axis  is  therefore  hard  to  localize. 


6.5  Comments 

In  a  sense  these  aspect  maps  are  property  spheres,  where 
the  property  is  a  type  of  degeneracy.  Computing  them 
demands  substantial  computational  time  and  storage  [Besl 
and  Jain  85];  both  would  benefit  from  a  hierarchic,  trixef-like 
approach.  Note  that  since  the  sphere  is  topologically 
equivalent  to  the  extended  plane,  all  such  maps  can  be  drawn 
as  planar  graphs  [Werman,  Baugher,  and  Guattleri  86] .  (See 
Figure  4,  where  the  "standard"aspect  graph  has  been 
augmented  to  show  all  possible  transitions  out  of  degenerate 
views,  as  alluded  to  in  [Castore  84]4). 

The  finite  resolution  aspect  maps  of  several  polyhedral 
solids  appear  to  be  isomorphic  to  semiregular  polyhedron  of 
the  class  of  "Stott”  figures,  described  in  [Stott  10].  They  have 
the  pleasant  property  that  each  has  a  Hamiltonian  circuit 
[Miller  71],  placing  a  firm  lower  bound  on  the  complexity  of 
disambiguating  an  object  given  its  image  and  some  finite 
resolution  image  feature. 


7  Relative  Probabilities  of  Viewpoint  Classes  for 
the  Cube 

We  might,  in  a  given  situation,  wish  to  know  the  probability 
of  reaching  a  particular  class  of  viewpoint,  given  a  random 
decision  as  to  “where  to  go  next.”  For  the  case  of  the  cube, 
we  can  obtain  the  probabilities  of  1 ,  2,  and  3-faced  views,  by 
using  spherical  geometry  to  calculate  the  relative  surface 
areas  on  the  viewing  sphere  of  the  3  regions  described  above 
(square,  rectangular,  and  triangular  patches). 

An  analysis  of  the  cube  shows  that  these  probabilities  are 
generally  a  function  of  system  resolution  only,  independent  of 
the  distance  from  the  object.  Systems  capable  of  higher 
resolution  will  generally  be  less  likely  to  yield 
“uncharacteristic”  views,  as  one  would  expect.  The  details  of 
our  analysis  are  presented  in  Appendix  I. 


8  Minimal  Disambiguation  Distance 

Transitions  between  the  various  regions  of  the  sphere 
represent  what  [Koenderink  and  van  Doom  79]  term  a  “visual 
event".  Only  such  a  transition  Is  capable  of  yielding 
qualitatively  new  information.  Thus  these  transitions  clearly 
represent  a  useful  change  of  viewpoint,  which  would  be  worth 
paying  for  in  terms  of  traveling  distance. 

For  the  purposes  of  minimizing  the  distance  traveled  in 
obtaining  further  images,  it  will  be  useful  to  quantify  the 
minimal  distance  we  must  travel  on  our  viewing  sphere,  to 
ensure  that  we  will  experience  a  visual  event.  If  we  reach  a 
characteristic  view  from  which  an  image  has  not  yet  been 
obtained,  then  we  will  maximize  the  probability  that  any 


4ln  particular,  we  have  added,  without  loss  of  planarity,  the  direct 
transitions  between  regions  where  1  face  Is  viable,  and  regions  where  3 
faces  are  visible.  We  noticed  that  in  their  excellent  survey  on  3D  object 
recognition,  [Best  and  Jain  85,  p.  89]  did  not  show  such  transitions  in  their 
aspect  graph  of  a  cube,  nor  did  their  analog  .•»  appear  in  the  aspect  graph  of 
a  tetrahedron,  presented  by  Koenderink  and  van  Doorn.  For  an  interesting 
discussion  of  the  relevance  of  such  transitions  to  robotic  vision,  see  the 
remarks  of  N.  Badler  In  [Castore  84] 


degeneracy  will  be  disambiguated  by  the  new  information..  In 
the  case  of  our  viewing  sphere  of  a  cube,  this  would  mean 
ensuring  that  a  3-faced  image  Is  obtained. 


[Ballard  and  Brown  82] 

Ballard,  D.  H.,  and  Brown.  C.  M. 
Computer  Vision. 

Prentice-Hall,  Englewood  Cliffs,  NJ.  1982. 


Using  the  assumptions  of  orthographic  projection,  we  can 
easily  quantify  this  minimum  distance.  For  a  fixed  viewing- 
sphere  radius  of  resolution  n  (i.e.  n  is  an  absolute  number 
equal  to  the  number  of  pixels  our  object  will  occupy  in  the 
image),  the  distance  we  must  travel  along  the  viewing  sphere 
is  equal  to  the  product  of  the  length  of  the  radius  and  the 
arcsin  of  ( 1/n)  radians5. 


9  Closing  Observations  and  Possible  Future 

Research 

Calculating  the  number  of  necessary  views  and  the  effort  to 
obtain  them  is  a  formidable  task.  In  some  senses  it  resembles 
the  design  of  part  feeders  [Natarajan  86]:  that  is,  given  an 
unknown  position  on  the  viewing  sphere,  determine  what 
series  of  camera  movements  would  inevitably  lead  to  a 
distinguished  configuration,  namely,  the  acquisition  of  all 
relevant  semantic  information  about  an  object  or  object 
assembly.  Even  assuming  one  knows  perfectly  where  one  is 
on  the  viewing  sphere,  the  determination  of  even  the  distance 
to  the  nearest  visual  event  is  complex,  given  its  probabilistic 
nature.  Circumstances  are  easy  to  construct  (for  example, 
when  the  object  is  too  small)  where  it  is  actually  impossible. 

The  aspect  map  can  be  augmented  with  other  information. 
It  can  incorporate  probabilities  such  as  the  likelihood  of  a  given 
gravity-induced  preferred  orientation,  or  it  can  be  convolved 
with  a  placement  uncertainty  spread  function.  The  spread 
function  can  be  variable,  itself  incorporating  such  information 
as  the  robotic  work  space  or  other  constraints  on  placement 
motion.  A  search  for  the  optimal  next  view  could  then  also 
minimize  camera  placement  error  and  also,  by  related 
methods,  camera  placement  costs. 

Such  algorithms  would  be  particularly  valuable  if  ways  exist 
to  formally  combine  the  aspect  maps  of  individual  objects  to 
create  the  aspect  map  of  an  object  assembly.  Thus,  from  a 
few  primitives  and  a  little  knowledge  of  the  robotic  placer  and 
its  workspace,  a  single  representation  could  direct  active 
sensing.  Whether  or  not  such  a  representation  is  ultimately 
practical,  it  has  nevertheless  been  helpful  in  elucidating  the 
meanings  of  "general  viewing  position"  and  "degenerate 
view". 

References 

[Ansaldi,  De  Floriani,  and  Falcidieno  85] 

Ansaldi,  S.,  De  Floriani,  L.,  and  Falcidieno, 
B. 

Geometric  Modeling  of  Solid  Objects  using 
a  Face  Adjacency  Graph 
Representation. 

In  SIGQRAPH  35  Conference 

Proceedings,  pages  131-139.  San 
Francisco,  CA,  July  22-26, 1985. 


sin  Appendix  II,  we  present  the  derivation  of  the  minimum  angb 
guaranteed  to  produce  a  visual  event.  The  distance  we  must  travel  along 
the  viewing  sphere  is  obtained  by  multiplying  this  angle  by  the  radius  of  the 
viewing  sphere. 


[Best  and  Jain  85]  Besl,  P.  J„  and  Jain,  R.  C. 

Three-Dimensional  Object  Recognition. 

ACM  Computing  Surveys,  17(1):75-145, 
March,  1985. 

[Binford  82]  Binford,  T.  O. 

Survey  of  Model-Based  Image  Analysis 
Systems. 

International  Journal  of  Robotics  Research 
1(1):18-64,  1982. 

(Spring  issue). 

[Brooks  81]  Brooks,  R.  A. 

Symbolic  Reasoning  among  3-Dimensional 
Models  and  2-Dimensional  Images. 

Artificial  Intelligence,  17:285-348,  August, 
1981. 

[Callahan  and  Weiss  85] 

Callahan,  J.,  and  Weiss  R. 

A  Model  for  Describing  Surface  Shape. 

In  Proceedings  of  the  IEEE  Conference  on 
Computer  Vision  and  Pattern 
Recognition,  pages  240-245.  San 
Francisco.  CA,  June,  1985. 

[Canny  84]  Canny,  J.  F. 

Algorithms  for  Model-Driven  Mechanical 
Part  Inspection. 

Research  Report  RC10505,  IBM, 
November,  1984. 

[Castore  84]  Castore,  G.  M. 

Solid  Modeling,  Aspect  Graphs,  and  Robot 
Vision. 

In  Pickett,  M.  S.,  and  Boyse,  J.  W.  (editors), 
Solid  Modeling  by  Computers:  From 
Theory  to  Applications,  pages  277-292. 
Plenum  Press,  New  York  -  London, 
1984. 

[Chakravarty  82]  Chakravarty,  I. 

The  Use  of  Characteristic  Views  as  a  Basis 
for  Recognition  of  Three-Dimensional 
Objects. 

PhD  thesis,  Image  Processing  Laboratory, 
Rensselaer  Polytechnic  Institute,  1982. 

(report  number  IPL-TR-034). 

[Chakravarty  and  Freeman  82] 

Chakravarty,  I.  and  Freeman,  H. 

Characteristic  Views  as  a  Basis  for  3D 
Object  Recognition. 

In  Proceedings  of  the  Society  for  Photo- 
Optical  Instrumentation  Engineers 
Conference  on  Robot  Vision,  Vol.  336 
(Arlington,  VA,  May  6-7),  pages  37-45. 
SPIE,  Bellingham,  WA,  1982. 

[Crawford  85]  Crawford,  C.  G. 

Aspect  Graphs  and  Robot  Vision. 

In  Proceedings  of  the  IEEE  Conference  on 
Computer  Vision  and  Pattern 
Recognition,  pages  382-384.  San 
Francisco,  CA,  June,  1 985. 

(presented  as  a  poster  paper). 


594 


(Fekete  and  Davis  84] 

Fekete,  G.  and  Davis,  L. 

Property  Spheres:  A  New  Representation 
for  3-D  Object  Recognition. 

In  Proceedings  of  the  IEEE  Workshop  on 
Computer  Vision:  Representation  and 
Control ',  pages  192-201.  Annapolis, 
MD,  April  30  -  May  2, 1 984. 

[Grimson  85]  Grimson,  W.  E.  L. 

Sensing  Strategies  for  Disambiguating 
Among  Multiple  Objects  in  Known 
Poses. 

Memo  855,  MIT  Artificial  Intelligence  Lab, 
August,  1985. 

[Ikeuchi  86]  Ikeuchi,  K. 

Generating  an  Interpretation  Tree  From  a 
CAD  Model  to  Represent  Object 
Configurations  for  Bin-Picking  Tasks. 

Technical  Report  CMU-CS-86-144, 
Department  of  Computer  Science, 
Carnegie-Mellon  University,  Aug.  7, 
1986. 

[Kanade  80]  Kanade,  T. 

A  Theory  of  Origami  World. 

Artificial  Intelligence,  13(3):279-31 1 ,  May, 
1980. 

[Koenderink  and  van  Doom  76] 

Koenderink,  J.  J.,  and  van  Doom,  A.  J. 

The  Singularities  of  the  Visual  Mapping. 

Biological  Cybernetics,  24(1):51-59, 1976. 

[Koenderink  and  van  Doom  79] 

Koenderink,  J.  J.,  and  van  Doom,  A.  J. 

Internal  Representation  of  Shape  with 
Respect  to  Vision. 

Biological  Cybernetics,  32(4):21 1-216, 

1979. 

[Korn  and  Dyer  86] 

Korn,  M.  R.  and  Dyer,  C.  R. 

3-D  Multiview  Object  Representations  for 
Model-Based  Object  Recognition. 

Research  Report  RC11760,  IBM,  March, 
1986. 

[Lavin  74]  Lavin,  M.  A. 

An  Application  of  Line-Labeling  and  other 
Scene-Analysis  Techniques  to  the 
Problem  of  Hidden-Line  Removal. 

Working  Paper  66,  MIT  Artificial  Intelligence 
Lab.  March,  1974. 

[Miller  71]  Miller,  J.  E. 

Transmission  of  Analog  Signals  over  a 
Gaussian  Channel  by  Permutation 
Modulation  Coding. 

PhD  thesis,  Columbia  University, 

Department  of  Mathematical  Statistics, 
1971. 

[Natarajan  86]  Natarajan,  B. 

Motion  Planning  and  its  Dual. 

PhD  thesis,  Cornell  University,  Department 
of  Computer  Science,  1986. 

forthcoming. 


[Requicha  80]  Requicha,  A.  A.  G. 

Representations  for  Rigid  Solids:  Theory, 
Methods,  and  Systems. 

ACM  Computing  Surveys,  l2(4):437-464, 
December,  1980. 

[Sabbah  82]  Sabbah,  D. 

A  Connectionist  Approach  to  Visual 
Recognition. 

PhD  thesis,  Rochester  University, 

Department  of  Computer  Science,  April, 

1982. 

(available  as  TR107). 

[Scott  84]  Scott,  R. 

Graphics  and  Prediction  from  Models. 

In  Proceedings  of  the  ARP  A  Image 
Understanding  Workshop,  pages 
98-106.  Science  Applications  Inc., 
McLean,  VA,  New  Orleans,  LA,  Oct. 

3-4, 1984. 

[Stott  10]  Stott.  A.  B. 

Geometrical  Deduction  of  Semiregular  from 
Regular  Polytopes  and  Space  Fillings. 
Ver.  der  Koninklijke  Akad.  van 

Wetenschappen  te  Amsterdam  (eerstie 
sectie),  11(1)1910. 

[Thorpe  and  Shafer  83] 

Thorpe,  C.,  and  Shafer,  S. 

Topological  Correspondence  in  Line 
Drawings  of  Multiple  Views  of  Objects. 
Technical  Report  CMU-CS-83-1 13, 
Department  of  Computer  Science, 
Carnegie-Mellon  University.  March, 

1983. 

[Werman,  Baugher,  and  Gualtieri  86] 

Werman,  M.,  Baugher,  S.,  and  Gualtieri, 

J.  A. 

The  Convex  Visual  Potential. 

Internal  Memo,  University  of  Maryland, 
Center  for  Automation  Research, 
February,  1986. 

[Wong  and  Fu  84]  Wong,  E.  K„  and  Fu,  K.  S. 

A  Graph-Theoretic  Approach  to  Model 
Matching  in  Computer  Vision. 

In  Proceedings  of  the  IEEE  Workshop  on 
Computer  Vision:  Representation  and 
Control,  pages  106-111.  Annapolis, 

MD,  April  30-  May  2,  1984. 

Appendix  I 

Probabilities  for  the  3  Viewpoint  Types  of  the  Cube 

These  probabilities  will  be  calculated  by  finding  the  relative  surface 
areas  on  the  viewing  sphere  for  the  three  kinds  of  regions  (see 
Figure  1).  Thus,  our  first  goal  is  to  find  the  surface  areas  of  the  three 
regions  on  the  viewing  sphere. 

We  start  by  examining  one  of  the  3  orthogonally  intersecting 
equatorial  bands.  To  find  the  surface  area  of  one  such  band,  we  use 
the  surface  area  integral  for  revolving  the  graph  of  a  non-negative 
parametric  function  around  the  X  axis,  where  V  and  /  are 
continuous,  and  V  remains  positive: 
tor  x(t),  y(t),  r  e  {cA 


595 


5-  \d2  ity(rW[x'(r)]2  +  [/(r))2<* 

Jc 


In  our  case,  we  generate  a  sphere  of  radius  r  by  revolving  about  the 
x-axis  the  following  parametric  arc:  *(r)  -  rcos(r),  >■(()-  rsin(r),  for 
»  e  [  * ,  *  +  G). 


TrianguIarPatck  *  ^  I  P '  SquaraPalck  +  P RactanguiarPatck  I 

.  ^  SB<uuj  -  6  Ss<tltar, 


Thus,  we  have  the  desired  surface  area,  S,  as: 

S  »  2it  2  r  sin  t  ^r2-  (sin2/  +  cos2r)  dt 


n  ,  ,  6  8  sine 

P TrianguIarPatck  m  1  3  Sin  8  + - - - 

Appendix  II 


-♦9 

*  2n  r2  f2  sin  t  dt 

•  w 


Derivation  of  0,  the  Minimum  Disambiguation  Angle1 


Pltr2  -  COS  t  ]  ^  + 


Looking  at  line  RG  (passing  through  point  F): 


SBand  =  4  it  r2  Sin  8. 

For  the  square  patches,  multiply  the  above  quantity  by  the  ratio 
|  ,  i.e.  the  ratio  of  the  curved  square  length  to  the 

circumference  (“great  circle")  of  the  sphere. 


^Square*4  '2  0  sin  9  • 


For  the  probability  that  a  random  viewpoint  will  be  from  one  of  these 
regions,  multiply  by  6  the  number  of  such  regions,  and  divide  by  the 
total  surface  area  of  the  sphere,  4  itr2: 


r  SquarePotck 


6  8  sine 


Xc  =  rcosd , 


Yc  =  r  sin  I? . 


The  slope  of  line  RG  is  —  (d  +  rsino)/(rcoso),  where  a  is  defined 
such  that  or  =  ^  —  6,  and  the  y-intercept  is  d.  Hence,  the  equation  of 
line  RG  is: 

/-rf  +  rsino\  +<{ 

\  r  cos  a  ) 

To  find  Xf ,  we  intersect  RG  with  the  line  y  =  s: 


This  yields 


(~d  +  r  sincr\ 

!/=*=(  - I  Xf-  +  d . 

\  rcoso  ) 


(d  -  s) r  cos  or 

Af  =  - ; - ; -  . 

d  +  r  sinor 


However,  for  the  purposes  of  this  derivation,  we  shall  assume  that 
object  size  is  much  smaller  than  the  viewing  distance,  i.e. 


Now,  to  get  the  probability  that  we  are  in  one  of  the  bands,  but  not  in 
a  square  patch  (i.e.  that  we  see  exactly  2  faces),  multiply  Sg^  by  3 
(the  number  of  such  orthogonal  bands),  subtract  the  6  square  single¬ 
face-view  regions’,  and  divide  by  the  total  surface  area  of  the 
sphere: 

3  (dnr^sinB)  -  6  (d^esInS) 

4  it  r2 


'  Rectangular  Patch 


3  sine  (x-20) 


Once  we  have  calculated  Ps^rtPatck  and  PRMMIxlarPalck.  we  can 
calutate  the  probability,  "PTrianularPalck,  of  being  in  one  of  the  8 
triangular  regions  from  which  3  faces  of  the  cube  will  be  visible.  We 
simply  subtract  from  1  the  probability  of  being  in  any  other  region. 
That  is  to  say 


and  in  turn2. 


So  long  as  this  assumption  is  correct,  we  may  simplify  our  equation 
for  Xf  as  follows: 

(d  —  s)r  cosQ: 

- 5 - • 

Similarly,  looking  at  line  GQ  (which  continues  through  point  F). 
Xq  =  —  Yc  —  r  sin  a  , 

Yd  =  Ac  =  r  cos  q  . 

We  proceed  to  find  the  slope  of  line  GQ  and  its  y-intercept,  yielding 
the  equation  of  line  GQE  as 


(—  d  —  r  cos  a  \  _.  , 

- 7 -  )X  +  d. 

rsina  / 


Intersecting  line  GQE  with  the  line  y  =  s,  we  find  that 

(d-s)rsina 

Xe  =  ■“3 - ■ 

d  —  r  cos  a 


‘since  the  band  and  square  regions  are  not  disjoint,  we  must  make  sure  not  to 
count  the  square  patches  twice! 


-  the  smallest  angle  of  movement  along  the  Viewing  Sphere  which  will  guarantee 
a  visual  event  while  viewing  the  cube.  Note  that  we  assume  perspective  projection. 
2since  r  is  defined  as  y/2» 


Typeset  by  AVfS-TfeX 


596 


Again,  under  the  assumption  of  object  site  being  much  smaller  than 
viewing  distance,  this  simplifies  to 

(d-s)rsino 

XB* - - - . 

Now  we  wish  to  find  0 ,  such  that  the  distance  taken  up  in  the  image 
between  the  2  cube  corners,  Q  and  R,  is  one  pixel.  This  is,  in  fact, 
the  fundamental  constraint,  based  upon  which  we  are  attempting  to 
determine  the  disambiguation  angle  for  movement  along  the  viewing 
sphere  which  will  guarantee  the  “visual  event”  of  (heretofore  hidden) 
face  BC  becoming  visible  in  the  new  position  of  face  QR. 

The  precise  definition  of  this  constraint  is  that  segment  YZ  in  the 
image  plane  should  take  up  a  fraction  of  RH  inversely  proportional 
to  the  total  number  of  pixels  in  the  image  of  the  unrotated  cube  side. 
Thus, 

XF-  XE  =  i(2s). 
n 

Combining  this  with  previous  equations  yields 

V2 

cos  a  —  sin  a  =  —  . 

n 

However,  using  the  definition  of  a  =  -  —  0,  we  can  simplify  by 
trigonometric  substitutions,  in  terms  of  0,  eventually  arriving  at 

cos  a  —  sin  a  =  y/2  sin  9  . 


Figure  2 

Cube,  at  center  of 
the  "viewing  sphere" 


The  authors  wish  to  thank  Jonathan  Gross  and  Eugene  Pinsky  for 
their  helpful  comments  and  suggestions. 


Figure  1 

Classical  "degenerate"  view 


3 

faces 


Figure  3 


1  (ace 
visible 


Cube  at  center  of  viewing  sphere; 
degeneracies  delineated 


597 


TDK  ROLE  AND  USE  OF  COLOR  IN  A  GENERAL  VSSION  SYSTEM 


Glenn  Henley  and  Thomas  O.  Dinford 


Artificial  Intelligence  Laboratory 
Computer  Science  Department 
Stanford  University 


Stanford, 


Abstract 


II  c  show  that  an  intelligent  approach  to  color  can  be 
used  to  significantly  improve  the  capabilities  of  a  vision 
system.  This  paper  is  divided  into  three  jmrts.  In  part 
I.  we  adopt  general  physical  moilcls  for  the  properties 
of  objects  which  determine  how  they  interact  with  light. 
These  models  are  far  more  general  than  those  typically 
used  in  computer  vision.  From  these  models  we  derive 
generic  properties  relating  color,  geometry,  and  image 
inndiance.  In  jmrt  II,  we  discuss  the  recovery  and  rep- 
rt  sentation  of  color,  li  t  show  bote  to  recover  physical 
color  from  images  captured  using  different  sensors.  Un¬ 
der  general  assumptions,  we  derive  repressions  for  sen¬ 
sor  spectral  transmission  functions  which  are  provably 
optimal  for  color  recovery.  .1  metric  space  for  colors 
is  introduced  and  its  significant  properties  arc  described. 
Within  this  general  framework  for  color,  we  discuss  the 
advantages  of  the  specific  representation  employed  by 
our  system.  In  part  III,  we  use  our  physical  models  to 
derive  powerful  algorithms  for  eilracling  invariant  prop¬ 
erties  of  objects  from  images.  The  first  algorithm  is  used 
for  the  generic  classification  of  objects  according  to  ma¬ 
terial.  The  second  algorithm  provides  a  solution  to  the 
color  constancy  problem.  These  algorithms  have  been 
implemented  and  consistently  produce  correct  results  on 
iral  images.  Some  examples  of  cxjMirimenlal  results  are 
presented. 


1.  Introduction 

A  general  vision  system  must  he  capable  of  gener¬ 
ating  meaningful  descriptions  of  a  scene  in  any  of  the 
diverse  environments  for  which  human  vision  is  use¬ 
ful.  Nearly  all  existing  artificial  vision  systems  arc  spe¬ 
cial  purpose.  They  rely  heavily  on  domain-specific  con¬ 
straints  and  special-case  engineering.  Although  signifi¬ 
cant  progress  has  been  made  in  niaehiue  vision,  the  best 
artificial  vision  systems  still  fall  far  short  of  achieving 
the  capabilities  associated  with  general  vision. 


CA  94305 

Any  general  vision  system  must  possess  robust 
generic  models  of  both  the  physical  world  and  the  image- 
firming  process.  For  a  vision  system  performing  a  fixed 
task  in  a  simple  environment,  simple  domain-specific 
models  are  often  adequate.  Unfortunately,  such  sim¬ 
ple  models  typically  become  useless  when  the  system  is 
presented  with  a  slightly  more  complrx  environment. 

Given  robust  physical  models,  a  general  vision  sys¬ 
tem  must  bo  able  to  use  these  models  to  extract  invari¬ 
ant  properties  of  objects  from  images.  Invariant  proper¬ 
ties  are  those  properties  which  do  not  depend  on  either 
the  imaging  geometry  or  illumination  conditions.  The 
ability  to  infer  invariant  properties  of  objects  allows  a 
vision  system  to  generate  meaningful  descriptions  of  a 
scene  in  any  environment.  The  human  vision  system,  in 
particular,  is  remarkably  successful  al  inferring  invari¬ 
ant  properties  of  objects  from  images  in  widely  varying 
environments. 

In  this  paper,  we  analyse  the  role  ami  use  of  color  in 
a  general  vision  system.  As  the  first  step  in  the  analysis, 
we  examine  the  physics  of  reflection  to  develop  general 
models  which  describe  the  formation  of  color  images. 
From  these  models,  we  isolate  the  invariant  properties 
of  objects  with  respect  to  color.  Once  these  invariant 
properties  are  understood,  we  derive  general  procedures 
which  reliably  extract,  these  invariant  properties  from 
images. 

This  paper  is  organized  into  three  parts.  Part  I  (sec¬ 
tions  *2-4 )  describes  the  physics  underlying  the  formation 
of  color  images.  Part  II  (sections  5-8)  gives  methods  for 
color  recovery  and  representation.  Part  III  (sections  9- 
12)  presents  our  color  algorithms  and  experimental  re¬ 
sults. 

I.  PHYSICAL  CAUSES  OF  COLOR 

The  Physics  of  Reflection 
There  are  several  ways  an  object  can  modify  in- 


599 


title nt  light.  Ail  object  can  change  light  spatially  by 
reflecting  it  into  a  small  or  large  angle.  Incident  light 
can  be  modified  spectrally.  Also,  the  energy  of  incident 
light  can  be  reduced  if  a  large  fraction  of  the  light  is 
absorbed  by  an  object. 

In  this  section,  we  describe  general  models  for  the 
properties  ef  objects  which  determine  how  they  interact 
with  light.  In  part  III,  we  use  these  models  to  develop 
techniques  for  inferring  invariant  properties  of  objects 
from  images. 


2.1.  Fresnel  Reflection 

When  light  is  incident  on  the  surface  of  an  object, 
some  fraction  of  it  is  reflected.  A  smooth  surface  re¬ 
flects  light  only  in  the  direction  such  that  the  angle  of 
incidence  equals  the  angle  of  reflection.  The  properties 
of  this  specularly  reflected  light  are  determined  by  the 
optical  and  geometric  properties  of  the  surface.  For  ma¬ 
terials  which  are  optically  homogeneous,  appearance  is 
completely  determined  by  the  properties  of  specularly 
reflected  light.  Metals  make  up  the  largest  class  of  ho¬ 
mogeneous  materials.  For  many  inhomogeneous  materi¬ 
als,  such  as  plastics,  specular  effects  are  also  significant. 

The  optical  properties  of  a  surface  material  are 
summarized  by  the  complex  index  of  refraction  M  — 
u  iA’d  where  n  is  the  refractive  component  and  Kq  is 
(lie  absorptive  component,  lioth  n  and  f\„  arc  func¬ 
tions  of  wavelength.  From  M,  the  Fresnel  equations 
completely  describe  the  light  reflected  from  a  surface. 
If  uupolarized  light  is  incident  at  an  angle  fl|,  then  the 
monochromatic  specular  reflectance  F  is 

F  =  0.5(«x  I  R,|)  (2.1) 


where  R  |  is  the  perpendicular  polarized  component  and 
/f II  is  the  parallel  polarized  component.  From  electro¬ 
magnetic  theory,  R  l  and  R||  are  given  by 


a 2  f  h2  -  'Incosd i  h  co»‘10i 
a 2  |  b2  I  2acos0i  +  cos29i 


*11  -  « t 


a2  I  b2  -  2 asinO itanQ t  -I  sin29ftan2$i 
a2  +  b2  I-  2 asin9  iian9 1  1-  sin29[lan20i 


(2.3) 


where 


a  =  0.5 


\f  /(n3  -  K*  -  «n3#,)3  +  4n3*3  +  (n3  -  Kj  -  .in’fl,)  (2.4) 


5  =  0.5 y  y/(nJ  -  A'j  -  +  4n’A'J  -  (n3  -  A'J  -  «m3S|)  (2-5) 

Fquations  (2.2)  and  (2.3)  are  known  as  the  Fresnel  equa¬ 
tions.  The  Fresnel  equations  are  derived  in  many  places 
including  [4 J. 

The  Fresnel  equations  describe  reflection  from  a 
smooth  surface.  In  practice,  most  surfaces  are  not 
smooth.  The  Torrance-Sparrow  specular  model  [23]  de¬ 
scribes  specular  reflection  from  rough  surfaces.  This 
model  assumes  that  a  surface  is  composed  of  small,  ran¬ 
domly  oriented,  mirror-like  facets.  Oidy  facets  with  a 
normal  oriented  in  the  perfect  specular  direction  con¬ 
tribute  to  the  monochromatic  specular  reflectance  R$- 
The  model  also  quantifies  the  shadowing  and  masking  of 
facets  by  adjacent  facets  using  a  geometrical  attenuation 
factor.  The  resulting  specular  model  is 


Rs  =--  FDA  (2.6) 


where 

F  =  Fresnel  specular  reflectance 
D  —  facet  orientation  distribution  function 
A  adjusted  geometrical  attenuation  factor 


2.2.  Colorant  Layer  Scattering 

For  inhomogeneous  materials,  the  most  prominent 
optical  process  is  the  scattering  of  light  by  colorant  lay¬ 
ers.  Inhomogeneous  materials  include  plastics,  paper, 
textiles,  and  paints.  In  this  section,  we  describe  a  phys¬ 
ical  model  for  the  scattering  of  light  by  inhomogeneous 
materials. 

The  fraction  of  the  incident  light  which  is  not  spec¬ 
ularly  reflected  enters  the  body  of  the  material.  For 
inhomogeneous  materials,  the  body  is  composed  of  a 
vehicle  and  many  embedded  colorant  particles.  While 
in  the  body  of  a  material,  light  interacts  with  many  col¬ 
orant  particles.  When  light  encounters  a  colorant  par¬ 
ticle,  some  portion  of  it  is  reflected.  The  net  result  of 
many  reflections  is  that  the  light  is  diffused  and  a  signif¬ 
icant  fraction  exits  through  the  surface  in  n  wide  range 
of  directions  (Figure  1). 

H.V  scattering  we  refer  to  this  process  of  diffusion 
by  ninny  reflections.  In  nddiliou  to  reflecting  light,  the 
colorant  particles  also  selectively  absorb  certain  wave¬ 
lengths.  This  selective  absorption  is  responsible  for  the 
color  of  the  scattered  light.  In  general,  this  light  will 
be  of  a  different  color  than  the  light  reflected  from  the 
surface. 


600 


(2.7) 


o 

o  o  o  o 


Figure  1.  Scattering  by  Colorant.  Particles 

Kuhelka-Munk  (K-M)  theory  [15]  is  a  general  math¬ 
ematical  treatment  of  scattering  and  absorption  in  col¬ 
orant  layers.  The  KM  theory  assumes  that  a  colorant 
layer  is  composed  of  a  large  number  of  optically  identi¬ 
cal  elementary  layers.  'Die  thickness  of  each  elementary 
layer  is  small  compared  to  the  thickness  of  the  entire 
colorant  layer,  but  is  large  compared  to  the  diameter  of 
individual  colorant  particles.  Thus  it  is  not  necessary  to 
model  the  optical  properties  of  individual  colorant  parti¬ 
cles.  The  effects  of  many  colorant  particles  arc  modeled 
by  the  properties  of  an  elementary  layer.  An  elementary 
colorant  layer  is  characterized  by  the  parameters  nr  and 
o.  c*(A)  is  the  fraction  of  light  which  is  absorbed  per 
unit  path  length.  it(A)  is  the  fraction  of  light,  which  has 
its  direction  reversed  by  scattering  per  unit  path  length. 
Both  «  and  a  are  functions  of  wavelength.  The  model 
gives  rise  to  simultaneous  first  order  differential  equa¬ 
tions.  These  equations  can  be  solved  to  give  expressions 
fo'  the  reflectance  and  transmission  of  a  colorant  layer. 

The  original  Kubelka-Muuk  theory  makes  several 
limiting  assumptions.  The  original  theory  assumes  the 
boundary  condition  of  diffusely  incident  light.  This  is 
an  unrealistic  assumption  for  most  real  situations.  The 
original  K-ld  theory  also  assumes  that  the  vehicle  con¬ 
taining  the  colorant  particles  has  an  index  of  refraction 
eqmd  to  that  of  air.  This  assumption  eliminates  the 
need  to  consider  internal  ami  external  reflections  at  the 
air-vehicle  interface.  Unfortunately,  this  assumption  is 
also  not  very  realistic. 

The  original  Kubelka-Munk  theory  has  been  ex¬ 
tended  by  Reichman  |I9]  to  eliminate  the  need  for  these 
unrealistic  assumptions.  Reichman  derives  an  expres¬ 
sion  for  the  reflectance  of  an  inhomogeneous  material 
which  is  valid  for  collimated  light  at  any  angle  of  in¬ 
cidence.  Reichman  also  uses  a  method  developed  by 
Orchard  [18]  to  take  into  account  both  internal  and  ex¬ 
ternal  reflections  at  the  air-vehicle  interface. 

For  onr  purposes,  we  consider  opnquc  odorant  lay¬ 
ers  composed  of  isotropir  scatterers.  For  light  incident 
ot  on  angle  0/,  the  extended  K-M  theory  describes  the 
l>ody  reflectance  II  a  as 


*•  - (l  _  Rs)-nr-'T^sr 

where  R§  is  the  specular  reflectance  given  by  (2.6).  r, 
is  the  internal  diffuse  surface  reflectance  approximated 
by  Orchard  [18]  as 


0.5601  -  0.7009 n  -f  0.3319n2  -  0.0636n3 

=  1  -  - - ~T - 

n 1 

(2.8) 

where  n  is  the  index  of  refraction  of  the  vehicle.  Let  w  — 
y„y„)  be  the  scattering  albedo.  R^,  is  the  reflectance 
predicted  by  original  K-M  for  diffusely  incident  light  and 
is  given  by 

2  -  tu  -  2\/ 1  -  w 

Roc  = -  (2.9). 

w 

C  and  D  result  from  the  solution  of  Reicluu ail’s  differ¬ 
ential  equations  and  are 

r  -  W  C°S  g|(2  cos  9 1  +JJ  to  i  n\ 

1  -  4(1  -  to)  cos2  9,  (  •  °) 

2  cos  0j  -  1 

D  -  o - o -  (2.11) 

2  cos  0,  I-  1 

One  very  special  ease  of  Ibis  model  is  conservative 
scattering  (also  called  Lambertian  scattering)  for  which 
U'(A)  —  1  for  all  visible  wavelengths.  A  Lambertian  scat¬ 
tering  model  is  frequently  assumed  in  computer  vision. 

2.3.  The  Reflectance  Model 

(liven  our  models  of  both  interface  and  body  reflec¬ 
tion,  we  can  quantify  the  reflectance  R  as 


R=RS  +  R0  (2.12) 

where  Rs  is  the  Fresnel  reflection  term  and  Rn  is  the 
Kubelka-Muuk  body  reflection  term.  The  power  of  the 
light  reflected  from  a  surface  at  a  single  wavelength  A0 
towards  a  viewer  is  then  given  by 

'(•M  -r  fl(A,i)L(A„)  (2-13) 

where  //(A)  quantifies  the  power  of  the  incident  light. 


3.  Image  Irrodiance  and  Geometry 

Image  irradinuce  is  a  single  mensure  of  how  much 


601 


visible  liglil  strikes  mi  men  of  mi  image.  It  is  <l<  lineii  ns 
power  per  unit  nrcn.  Image  irradiance,  therefore,  con¬ 
tains  no  direct  information  about  I  he  color  of  the  light 
striking  the  image,  i.e.,  many  diU'ercnt  spectral  distri¬ 
butions  of  light  will  give  the  same  image  irradiancc.  In 
this  section,  we  describe  the  relationship  between  image 
irradiancc  and  geometry  for  both  ideal  specular  surfaces 
and  ideal  diffuse  surfaces.  We  note  that  11 5  in  (2.12)  is 
defined  relative  to  an  ideal  specular  surface,  while  Rg 
in  (2.12)  is  defined  relative  to  mi  ideal  diffuse  surface. 

3.1.  Fresnel  Reflection 

We  begin  by  analyzing  the  properties  of  image  irra¬ 
diancc  for  a  perfectly  specular  surface.  In  this  limiting 
case,  all  light  incident  on  the  surface  will  be  reflected  in 
sueli  a  way  that  0„  ffj  and  L,  N ,  and  V  lie  in  the  same 
plane  (figure  2). 


where  ill  is  the  image  area  corresponding  to  the  surface 
area  dA. 

The  ratio  '[()  is  determined  by  the  imaging  geome¬ 
try  (Figure  3) 


Figure  3.  The  Imaging  Geometry 


N 


Figure  2.  The  Reflection  Geometry 

Thus,  all  light  incident  on  a  surface  point  from  a  single 
direction  will  he  reflected  in  ft  single  direction. 

Consider  a  small  source  of  radiance  L,  and  a  surface 
patch  dA.  Then  the  power  of  the  light  incident  on  dA  is 
given  by 

$  =  LiCoaBidAdw  i  (3-1) 

where  dw,  is  the  solid  angle  of  the  source  as  viewed 
from  dA.  If  d>  is  taken  to  be  the  azimuthal  angle,  with 
, j;  .  0  corresponding  to  the  direction  of  the  projection 
of  V  onto  (lie  tangent  plane  to  the  surface,  then,  by 
assumption,  all  of  the  incident  light  will  be  reflected  in 
the  direction  0„  -  0,,  <t>  -  0.  If  our  imaging  system 
is  aligned  in  this  direction,  light  of  power  <(•  will  pass 
through  the  lens.  For  any  other  position  V,  no  light  from 
dA  will  enter  the  lens  and  image  irradiancc  resulting 
from  light  reflected  from  dA  will  be  zero.  Assuming  110 
losses  in  the  lens,  image  irradiance  E,  will  be  given  by 

*  L,co$9,dAdw  i  .  . 

E'  “  57  " - Sr -  (3  2) 


dA  cosfi  /  ;  \  2 

dl  cos-y  \f) 


For  most  real  imaging  situations,  jj  will  be  large 
and  we  will  measure  a  large  image  irradiance  for  the  case 
of  perfect  specular  reflection. 


3.2.  Colorant  Layer  Scattering 

A  perfectly  diffuse  (Lambertian)  surface  reflects  all 
incident  light  in  such  a  way  that  the  perceived  bright¬ 
ness  of  a  surface  element  is  constant  with  respect  to  V . 
Consider  a  small  source  of  radiance  L,  with  solid  an¬ 
gle  dui,  as  viewed  from  a  surface  patch.  The  surface 
irradiance  is  then  given  by 

Es  —  L,cos9  idwi  (3-5) 

From  our  definition  of  a  perfectly  diffuse  surface, 
I  lie  bidirectional  reflectance  II  must  be  a  constant.  Since 
this  surface  reflects  all  incident  light,  the  light  reflected 
into  the  viewing  hemisphere  must  equal  lis-  Therefore, 
wc  have 


/[  R  L ,  coad  idw  ,cos9  „sin9  „  d$„  d4>  =  b,  cos# /dui; 
-**o 

(3.6) 

where  sinfl vdO„d</>  is  the  solid  angle  of  a  patch  of  site 


602 


<W„  iii  zenith  angle  and  </</>  in  azimuthal  angle.  The 
cost?,,  occurs  in  the  integrand  since  we  are  integrating 
over  projected  solid  angle. 

Equation  (3.6)  can  be  solved  to  give 

II  =  -  (3.7) 

it 

The  radiance  from  the  surface  is 

L  s  =  -- L  ,cos$  iilw  i .  (3.8) 

7T 

The  relationship  between  surface  radiance  and  im¬ 
age  irradiance  is  derived  in  many  places  including  [I2j 
and  is  given  by 


where  <1  is  I  he  diameter  of  the  lens.  For  a  bamberlian 
surface  we  have 

fc’j  -  ^-cosOidw,  ^  -  ^  cos*  fi  (3.10) 

In  the  usual  case  for  which  the  image  covers  a  narrow 
angle,  3  will  be  nearly  zero  and  cos*  p  will  be  nearly 
unity. 

Here  we  see  an  important  result  of  the  geometric 
difference  between  specular  reflection  and  diffuse  reflec¬ 
tion.  The  resnlt  is  that  for  almost  all  imaging  situa¬ 
tions,  the  image  irradiance  corresponding  to  specular 
reflection  is  much  larger  than  the  image  irradiance  cor¬ 
responding  to  diffuse  reflection.  Qualitatively,  the  rea¬ 
son  for  this  difference  is  that  the  specularly  reflected 
light  is  concentrated  in  a  single  direction,  while  the  dif¬ 
fusely  reflected  light  is  spread  out  over  the  viewing  hemi¬ 
sphere.  We  will  return  to  this  result  in  section  9  when 
we  describe  a  method  which  can  be  used  to  distinguish 
specularly  reflected  light  from  diffusely  reflected  light. 

4.  Color 

We  define  the  color  of  light  reflected  from  a  surface 
to  be  the  light’s  spectral  power  distribution  /(A)  in  the 
visible  range.  Thus 

/(A)  =  L(X)R(X)  (4-1) 

where  L( A)  is  the  spectral  power  distribution  of  the 
light  incident  on  the  surface  and  7? ( A)  is  the  spectral 
reflectance  of  the  surface.  In  this  section,  we  describe 
generic  properties  of  the  function  tt( A)  with  respect  to 
both  geometry  and  the  properties  of  the  reflecting  ma¬ 
terial. 


4.1.  tresiiel  Reflection 

4.1.1.  Homogeneous  Materials 

Optically  homogeneous  materials  have  a  constant 
index  of  refraction  throughout  the  material.  Light  which 
is  incident  on  a  homogeneous  material  is  either  specu¬ 
larly  reflected  at  the  surface  or  absorbed  by  the  material. 
Since  light  ;s  not  scattered  in  the  body  of  the  material, 
no  light  which  enters  the  material  returns  through  the 
surface.  Metals  provide  the  most  abundant  examples  of 
homogeneous  materials. 

The  spectral  reflectance  of  a  homogeneous  mate¬ 
rial  is  determined  entirely  by  the  Fresnel  component 
of  reflectance  Rs.  For  fixed  geometry,  the  variation  of 
Itg  with  A  depends  on  the  complex  index  of  refraction 
M  —  rt ( A )  ih'n(A).  For  some  homogeneous  materials, 
M  can  vary  considerably  with  wavelength.  Copper,  for 
example,  reflects  long  visible  wavelengths  much  more 
efficiently  than  short  visible  wavelengths.  Conversely, 
the  reflectance  of  aluminum  is  approximately  constant 
across  the  visible  spectrum. 

4.1.2.  f  iitioiiiogcncoiis  Materials 

The  body  of  an  inhomogeneous  material  is  made  up 
of  colorant  particles  embedded  in  a  vehicle.  While  most 
homogeneous  materials  are  characterized  by  a  large  ex¬ 
tinction  coefficient  h  I, ,  inhomogeneous  materials  have  a 
negligible  extinction  coefficient  across  the  visible  spec¬ 
trum  (A’n(A)  0).  For  inhomogeneous  materials  n(A) 

depends  on  wavelength,  bill  this  dependence  is  typically 
small.  For  most  inhomogeneous  materials,  n ( A )  is  con¬ 
stant  to  less  than  five  percent  across  the  visible  spectrum 
1 1  lj.  Since  both  n  and  are  nearly  constant  for  visible 
light,  Its  is  constant  with  respect  to  A.  /?.,■  is  a  function 
of  only  geometry  for  inhomogeneous  materials. 

4.1.3.  Geometrical  Effects 

From  (In-  Fresnel  equations  (section  2),  we  have 
that  for  fixed  n  and  fixed  l\„  the  specular  reflectance  is 
approximately  constant  over  a  large  range  of  incidence 
angles  (roughly  0°  '  IR  it)'  )  2‘2|.  As  0;  nears  rr/2, 
however.  It s  approaches  unity  for  all  values  of  the  com¬ 
plex  index  of  refraction.  Consequently,  as  we  approach 
glancing  incidence,  the  color  of  the  specularly  reflected 
fight  approaches  the  color  of  the  incident  light. 

■1.2.  Colorant  Laver  Scattering 
4.2.1.  Inhomogeneous  Materials 

The  color  of  an  inhomogeneous  material  is  primar¬ 
ily  due  to  the  scattering  and  absorbing  characteristics  of 


603 


<  oloraut  layers.  These  characteristics  are  described  by 
the  parameters  er(A)  and  <r(A).  In  the  limiting  case  of  a 
Lambertian  surface,  reflectance  is  constant  with  respect 
lo  wavelength.  Thus,  a  true  Lambertian  surface  will 
always  appear  white.  On  the  other  hand,  the  colorant 
particles  in  real  materials  tend  to  selectively  absorb  cer¬ 
tain  wavelengths  of  light  while  transmitting  others.  This 
selective  absorption  is  the  primary  cause  of  the  variation 
of  li  a  with  A. 


4.2.2.  Geometrical  Effects 


bv  the  function  /, ( V) .  Therefore,  at  each  image  point 
we  have  the  measured  values  a,  (0  <  t  <  n  —  l)  given  by 


s. 


[  f,(X)I(X)dX 
J  \ 


(5.1) 


where  A  ranges  over  the  entire  electromagnetic  spec¬ 
trum.  for  our  purposes  the  filters  /,( A)  will  be  nonzero 
only  for  values  of  A  in  the  visible  range  (i.e.  400  nm 

<■  A  TOO  nm). 

Suppose  '.u  approximate  the  function  /(A)  by  a  lin¬ 
ear  combination  of  m  basis  functions  l,(X).  If  we  let  ay 
be  the  components  of  /(A)  on  this  basis,  then  we  have 


We  have  already  observed  that  the  color  of  a  Lam¬ 
bertian  surface  is  constant  with  respect  to  geometry.  Al¬ 
though  (2.7)  implies  that,  in  general,  the  color  of  light 
scattered  from  the  body  of  a  material  depends  on  ge¬ 
ometry,  this  dependence  is  usually  small.  Figure  4  is 
a  plot  of  «„(!>,)  for  different  values  of  the  scattering 
albedo  w.  The  line  with  li ii(Oi)  I  corresponds  to  the 
conservative  scattering  case. 


/(*)«  E  «iW  (5-2)- 

0<><m  I 

Substituting,  (5.1)  becomes 

/■(*)(  E  a>W))«*A  (5-3) 


10  40  to  to 

B i  (degrees) 


figure  4.  Ro(0i) 


which  may  be  written 


o, 


E  «>(//.(A)/,(A)d  a)  (5.4) 


Let 


A'.,  -  /  /,(A)/,(A)rfA  (5.5) 

J  A 

dcuoU  'he  in,,.rrr!  in  (5.4).  Then  A’,y  is  a  constant 
which  depends  only  on  the  itli  filter  function  and  the 
jl  li  basis  function.  Let  s  be  the  n-dimensional  vector 
defined  by  s(i)  s,.  Let  A  be  the  u  x  m  constant  matrix 
defined  by  l\(i,j)  =  A,y.  Let  a  be  the  m-dimeusional 
vector  defined  by  a(i)  -  a,.  Then  we  have  the  linear 
system  of  n  equations 


II.  COLOR  RECOVERY  AND  REPRESENTATION 

We  now  begin  the  second  major  part  of  this  paper. 

Li  the  next  four  sections,  we  develop  methods  for  the 
recovery  and  representation  of  color. 

5.  Recovering  the  Function  /(A) 

In  this  section,  we  describe  a  method  for  recovering 
the  function  /(A)  at  a  point  in  an  image.  At  each  image 
point,  we  measure  the  outputs  a,  of  n  sensors.  Each  sen¬ 
sor  has  a  certain  wavelength  sensitivity  which  we  denote 


s  —  Ka  (5.6). 

If  we  choose  our  filters  /.(A)  and  basis  functions  /y(A) 
such  that  A  has  maximal  rank,  then  the  li  sensor  out¬ 
puts  Si,,  a  |  . ...,  j„.|  uniquely  determine  n  components 
on, a.,  i  of  /(A).  Therefore,  by  letting  m  =  n  in 
(5.6)  we  can  recover  an  estimate  of  the  function  /(A)  on 
the  basis  1 A). 

6.  Selecting  the  Sensors  /,( A) 

Given  the  recovery  technique  described  in  section 


504 


.r),  we  examine  how  a  careful  choice  of  the  sensors  /.(A) 
can  improve  tin-  quality  of  our  recovered  approximation 
to  /(A),  lit  section  5,  we  suggested  only  that  the  sensors 
/,( A)  should  lie  chosen  such  that  f\  has  maximal  rank. 
In  this  section,  we  derive  expressions  for  the  /,( A)  which 
guarantee  we  will  recover  the  least  error  polynomial  ap¬ 
proximation  to  /(A). 

Since  we  are  considering  a  polynomial  approxima¬ 
tion  to  /(A),  we  choose  our  m  basis  functions  I}( A)  to 
span  the  space  of  polynomials  of  degree  less  than  ill. 

We  deline  the  least  error  polynomial  approximation  to 
l*e  the  choice  of  d,  which  minimizes 

/  ['(*)-  E  diV]2d\  (6.1) 

1  »<i<m-l 


E  K-'o)2-  E  *?]"  (g.7) 

From  (6.7)  we  sec  that  the  minimizing  values  of  a*  are 
given  by  a*  -  c,,  0  <  i  <  m  —  1. 

For  an  arbitrary  choice  of  functions  /,( A)  such  l'..al 
1\  lias  maximal  rank,  the  procedure  of  section  5  will  not 
in  general  recover  the  minimal  error  polynomial  given 
by  the  coeflicients  a".  But  by  choosing  the  functions 
/,( A)  appropriately,  we  can  guarantee  our  technique  will 
recover  the  least  error  polynomial.  One  such  choice  is 
/.(A)  =  F.(A)  0  <  i  <  m  -  1.  For  this  case,  we  will 
measure  s  -  r.  Using  (5.6)  we  will  recover  the  least 
error  approximation  a,  a*.  One  way  to  see  this  is  to 
rewrite  (5.2)  in  terms  of  ;>y( A) 


where  for  convenience  we  have  scaled  the  visible  spec¬ 
trum  to  be  the  range  —  1  <  A  <  1. 

Any  polynomial  of  degree  less  than  m  can  be  writ¬ 
ten  as  a  linear  combination  of  the  first  m  Legendre  poly¬ 
nomials  [5j.  Let  o*  be  the  coefficients  of  tin  least  error 
polynomial  on  the  basis  of  normalized  Legendre  polyno¬ 
mials  n,(A)  given  by 


,„(A)  =  V/2,-2-  'r.(A)  (6.2) 

wlu-re  P, (A)  is  the  Legendre  polynomial  of  degree  i.  The 
functions  ;>,(A)  are  normalized  in  the  sense  that 

J  -  l  (6.3). 

The  constants  a*  minimize 


n A)-  E  «iW)=  E  «>iW  (G-8) 

II  <  j<  m  l  ll<  j  <m  -•  1 

For  this  expansion  of  /(A)  in  pJ( A)  the  orthogonality  of 
Legemlre  polynomials  and  (5.5)  give  us 


if  i  -■  j; 
if  i  /  j. 


(6.9) 


so  that  / 1  is  the  identity  matrix.  From  (5.6)  a  s  and 
the  recovered  polynomial  will  be  the  least  error  polyuo- 

Since  for  physical  sensors  we  require  /,(A)  >  0  for 
1  <  A  <  I  wo  cannot,  directly  use  /,( A)  p,(A)  since 
7>,(A)  is  negative  for  some  values  of  i  and  A.  Below  we 
suggest  a  method  for  arriving  at  the  least  error  polyno¬ 
mial  approximation  using  an  equivalent  immhiT  of  phys¬ 
ically  realizable  sensors.  We  are  required  to  compute  the 
values 


/•' 


E  ".V.(A)]  dA 

n<»<nt  i 


(6  1) 


Define 


c,  -  j'  I (\)p,(\)d\  =  J  i  /(A)F.(A)dA 


(6.10) 


w: 


/(A)p,(A)dA. 


(6.5) 


'I'hen  using  (6.5)  and  the  orthogonality  of  Legendre 
polynomials,  (6.<1)  may  be  written 


F  -J'  h*)-  E  K«-.+  E  K)2]^ 

(6.6) 


Por  r0  wo  liavo 


(6.11) 


which  is  realizable  by  using  /»( A)  I.  Since  P,(A)  >  — 1 
for  —  1  <  A  -  1  we  can  compute 


f, 


l(A)i/’.(A)  |  l]dA  - 


(6.12) 


505 


since  /*;( A)  i  l  '  0  for  (  1  \  1)  ami  rit  is  known 

from  (0.11). 

7.  A  Metric  Space  for  Colors 

In  this  section  we  develop  a  metric  space  for  phys¬ 
ical  colors.  There  arc  two  important  reasons  why  a  vi¬ 
sion  system  should  possess  a  color  metric.  First,  a  color 
metric  allows  a  vision  system  to  determine  how  closely  a 
perceived  color  matches  a  known  color.  Second,  a  color 
metric  is  mcessnrv  if  a  system  hopes  to  locate  color  dis¬ 
continuities  in  an  image.  It  is  well  known  that  the  hu¬ 
man  visual  machinery  includes  a  color  metric  1 1 3] -  Our 
development,  however,  is  motivated  hy  the  tcclinicpies 
of  functional  analysis.  No  attempt  is  made  to  relate  our 
color  mctiic  to  the  color  metric  implicit  in  human  vi¬ 
sion.  Nevertheless,  it  is  likely  that  both  metrics  serve 
their  respective  vision  systems  in  similar  ways. 

YVe  In-gin  hy  distinguishing  the  total  power  of  the 
signal  /(A)  from  its  color.  The  total  power  of  f(A)  is 
given  by 


(I ),(!!),  ni!<!  (3)  easily  follow  from  (7.3)  ami  (4)  follows 
from  i  Ik-  ( 'amliy-Scliwarl /.  inequality.  Therefore,  (I  is 
a  metric  ami  the  s|>aee  of  physical  colors  under  d  is  a 
metric  space  ill  the  topological  sense. 

I’or  reasons  relevant  to  onr  application,  we  use  the 
Kiiclidcnn  distance  of  (7.3)  rather  than  the  commonly 
used  maximum  distance  defined  by 

d,.„.,(MA),MA))  -  ..u*xJMA)  h( A)|  (7.4). 

The  distance  of  (7.3)  provides  more  stability  than  (7.4) 
for  measured  functions  /[(A)  and  I  <( A).  While  inaccu¬ 
racies  over  a  small  range  of  A  ran  cause  large  deviations 
of  d,„ur,  the  distance  d  of  (7.3)  depends  on  the  dis¬ 
tance  between  the  functions  integrated  over  the  entire 
spectrum.  Therefore,  the  distance  d  will  usually  give  .a 

more  reliable  eliaraclcrization  of  the  distance  between 
two  measured  physical  colors  than  the  distance  d,„„x. 


/■:  -  /  /(A), /A  (7.1) 

where  the  interval  [-1,1]  represents  the  visible  spectrum. 
Total  power  is  a  physical  analog  of  the  psychological 
concept  of  brightness.  We  define  the  physical  color  of 
/(\)hv  tin  function  /(A)  having  lin’d  total  power 


/(A) 


1_W  _ 

/',/(*)<*  A 


(7.2) 


Physical  color  I(  A)  is  related  to  the  psychological  con¬ 
cept  of  hue.  The  space  of  physical  colors  is  the  space 
of  till  continuous  nonncgalive  functions  /(A)  on  [-1,1] 
having  unit  total  power. 

(liven  any  two  physical  colors  1  \(X)  and  ^(A)  we 
define  the  distance  from  / 1 ( A )  to  h(X)  hy 


8.  The  Normalized  Legendre  Polynomial  Repre¬ 
sentation 

In  section  G,  we  showed  how  to  select,  sensors  to  re¬ 
cover  the  least  error  polynomial  approximation  to  the 
function  /(A).  The  derivation  itself  suggested  that  the 
most  convenient  representation  for  this  approximation  is 
tlie  basis  of  normalized  Legendre  polynomials  given  by 
(ti.2).  Onr  implementation  uses  this  basis  to  represent 
color.  In  tl  lis  section,  we  show  how  the  basis  of  nor¬ 
malized  Legendre  polynomials  not  only  facilitates  the 
recovery  of  the  least  error  approximation  to  /(A),  but 
also  how  this  basis  simplifies  many  of  (lie  compulations 
performed  in  color  space. 

8.1.  Computing  Physical  Color  /(A) 

Using  the  normalized  Legendre  polynomial  repre¬ 
sentation,  the  technique  of  section  5  recovers  an  approx¬ 
imation  to  /(A)  of  the  form 


</(/.( A), /-.(A))  x  ['i(A)  -  MA)]‘ 


(IX  (7.3) 


/(A)  -  5]  «.P,(A) 

0<*<m  -  l 

Tin-  total  power  of  l( A)  is  given  by 


(8.1) 


We  note  that  (7.3)  satisfies  the  properties  of  a  distance 
function,  namely 

1.  d(MA),®)  >  0  if  MA)  Jf  hi A) 

2.  d(l\(X),Iy(X))  =  Q  _  _ 

3.  cf(MA).MA))  -  d(hW,l,W)  _  _ 

4.  J(h(X),h(X))  <  d(h(X).h(X))  +  d(hix),h(X)) 


E  -  J  f(X)dX  ~  J  ^  «,;>,(  A  )j dX  —  v^2 


la  o 
(8.2) 


where  the  last  step  follows  from  the  orthogonality  of  the 
fuurtious  p,( A).  Therefore,  the  total  power  of  /(A)  may 
he  determined  by  only  considering  the  first  coefficient 


606 


ill  llic  normalized  Legendre  polynomial  representation. 
From  the  total  power  of  I( A),  we  can  determine  the 
physical  color  /(A)  by 

n*)  =  4-  =  E  a-p-(A)  (8-3) 

V'2a"  0  <  «  <  m  —  1 


[(x)  "  ^  4  E  a<P'(A)  fd.10). 

1<,<T,1-1 

We  can  consider  the  physical  color  /(A)  to  be  the  point 
in  R"‘~l  with  coordinates  d|,a2, _ ,d,„_i  •  Since  we  re¬ 

quire  /(A)  >  0  on  -1  <  A  <  I,  physical  colors  form  a 
proper  subset  of  To  be  precise,  physical  colors 

are  those  points  of  for  which 


'  '  V2a0 


d.P;(A)  > 


-  1  <  A  <  1 


(8.11). 


H.2.  Metric  Space  Properties 

For  two  signals  /j  ( A )  and  /2(A)  given  by 


f|(A)  E  r'P‘(X)<  MX)  =  E  3iP‘(X) 


t  he  physical  colors  are 


MA)  E  f-P*(A)>  MA)  -  E  *>P«(A) 


where  the  >•,  and  s,  are  computed  as  in  (8.4). 

The  color  space  distance  between  /i(A)  and  /2 ( A) 
is  given  by 


(A).MA))  .  j  [  E  •  -;-)p.(A)] 

\  1  11  <  t<  in  -  1 


vliieh  simplifies  because  of  orthogonality  to 


Let  C"‘  ~ '  be  the  set  of  points  in  R'"  1  which  are  phys¬ 
ical  colors.  Points  in  C'""1  will  be  referred  to  as 

a  -  (dj.02 . «„.-i)  (812) 

where  the  coordinates  dj,  02,  ...,  dm_  i  have  the  same 
significance  ns  in  (8.10). 

A11  important  consequence  of  viewing  colors  as 
points  in  C"  ‘ 1  is  that  the  usual  euclidean  metric  in 
R"‘  1  is  equivalent  to  the  metric  we  defined  for  physi¬ 
cal  colors  in  section  7.  This  is  seen  by  examining  (8.8) 
and  recalling  that  ru  ,i„.  Thus,  our  representation  in 
terms  of  normalized  Legendre  polynomials  allows  intu¬ 
ition  about  the  familiar  distance  in  R"‘  1  to  be  applied 
to  distance  in  color  space. 

Another  useful  property  of  the  color  space  C"‘ -1 
is  that  the  color  corresponding  to  an  arbitrary  additive 
combination  of  two  functions  /|(A)  and  /_>(A)  will  lie  on 
the  line  in  C"  1  which  connects  the  physical  colors  C\ 
and  C->  corresponding  to  / 1  ( A )  and  I 2(A).  Let  / j  ( A )  and 
1 2(A)  be  the  functions  given  by 


7i(A)  -  E  r>p*(A)>  ma)  =  E  s>PiW 


mxihw)-- J  E  (f>  -*)*  (8-8) 


'J'he  corresponding  colors  in  Cm  'are 


8.3.  Color  Space  ns  a  Subset  of  R"‘  1 

From  (8.3),  our  representation  for  physical  colors  is 

/(A)  -  e  a'P><A)  (89)- 


Cl  -■  ,f,„_  1),  C2  ,«m-l)- 

(8.14) 

Consider  the  linear  combination  /3(A)  given  by 


From  (8.4)  we  see  that  all  physical  colors  have  ao  -= 
1  /  s/2.  We  can  write 


/3(A)  -  A'i/i(A)  T  AV2(a)  KuK2>0  (8.15) 


Substituting  gives 


607 


/3(A)-  £  (AVi  +  Ktst)Pi(X)  (8.16). 

U  <*<*»*  —  1 

The  color  of  /j(A)  in  C"‘~l  is  C3  where 


C3 


(K\ri  +  K3s\  K[Tj  +  K 2»2  ^iNi-i  +  Ai<m- 1 ' 

v^2(A'iro  -f  his »)  >/2(A"|Pfl  +  A'j j0)  v/2(Ahro  +  ffaJo)  ' 

(8.17). 


C’;i  can  be  written  as 


Cj  --  (r,  +  u(.i|  -  PiJ.Pj  I  u(ij  r2) . f,„_,  +  u(j„,-i  -  r,„.  ,)) 

(8  18) 


wliore 


Aa»o 

A 1  r,|  -f  K  2S(i 


(8.19) 


Therefore  C3  lies  on  the  line  in  C"‘  1  connecting  C\  and 

<73. 


III.  COLOR  ALGORITHMS 


We  now  move  to  the  third  major  part  of  this  pa¬ 
per.  In  the  next  three  sections,  we  derive  algorithms  to 

extract  invariant  properties  of  objects  from  images.  In 
section  12,  we  present  experimental  results. 


9.  Physical  Segmentation 

Given  an  image,  it  is  useful  to  locate  the  places  at 
which  image  irradiance  is  discontinuous.  These  discon¬ 
tinuities  are  important  because  they  usually  correspond 
to  significant  physical  events  in  a  scene.  Many  tech¬ 
niques  have  been  developed  to  detect  image  irradiance 
discontinuities.  A  summary  of  some  of  the  issues  in¬ 
volved  in  edge  detection,  as  well  as  an  elegant  approach 
to  the  problem,  is  given  in  [17). 

While  locating  irradiance  discontinuities  has  at¬ 
tracted  nine  It  attention  in  computer  vision,  less  work 
lias  been  done  on  the  important  problem  of  identifying 
the  underlying  causes  of  these  discontinuities.  The  most 
common  physical  causes  of  image  irradiance  discontinu¬ 
ities  are  illiimiitnlion  discontinuities,  surface  orientation 
discontinuities,  specular  discontinuities,  pigment  den¬ 
sity  discontinuities,  and  material  discontinuities.  We 
will  refer  to  the  problem  of  classifying  image  irradiance 


discontinuities  according  to  llicir  physical  cause  as  the 
physical  segmentation  problem. 

Some  progress  lias  been  made  on  classifying  cer¬ 
tain  kinds  of  irradiance  discontinuities.  Ilerskovits  and 
llinforil  1 0 1  and  Morn  [It]  discuss  ways  to  classify  edges 
in  images  of  polv lied ra.  Ilinford  |3|  examines  ways  to 
distinguish  discontinuities  due  l.o  illumination,  geome¬ 
try,  and  reflectance.  Wilken  [2-1]  attempts  to  classify 
an  edge  by  using  intensity  correlation  across  the  edge. 
In  the  context  of  computing  intrinsic  images,  Barrow 
and  Triicuhaiiiii  |l]  describe  methods  for  for  classifying 
certain  kinds  of  edges  in  a  limited  domain. 

Some  progress  lias  also  been  made  in  using  color 
to  classify  edges.  Ituhin  and  liieliards  |20]  have  pro¬ 
posed  a  method  for  finding  material  discontinuities,  but 
their  assumptions  and  physical  models  appear  limiting. 
Gershon,  Jepsou,  and  Tsolsos  |(i]  discuss  a  more  general 
method  for  distinguishing  material  changes  from  shadow 
boundaries.  Shafer  [21]  lias  developed  a  method  to  sep¬ 
arate  diffuse  scattering  from  specular  reflection  using 

color.  The  scope  of  this  method  is  limited  to  inhomoge¬ 
neous  materials  for  which  the  color  of  the  diffuse  reflec¬ 
tion  differs  from  that  of  the  specular  reflection. 

The  complete  solution  to  the  physical  segmenta¬ 
tion  problem  is  t lie  subject  of  another  report.  Here  we 
restrict  ourselves  to  discussing  distinctive  properties  of 
the  image  irradiance  discontinuities  which  occur  where 
the  specular  component  of  the  reflected  light  becomes 
significant. 

As  discussed  in  section  3,  the  most  conspicuous  fea¬ 
ture  of  specular  reflection  is  that  i*  is  invariably  associ¬ 
ated  with  image  irradiance  values  which  are  much  larger 
than  those  in  neighboring  image  regions.  Moreover,  the 
expected  magnitude  of  the  difference  in  irradiance  is  di¬ 
rectly  related  to  propert  ies  of  the  imaging  system  which 
are  often  known.  Another  quasi-invariant  property  of 
specular  features  is  that  they  are  typically  small,  es¬ 
pecially  for  curved  surfaces.  Thus  unless  a  specularity 
occurs  at  the  edge  of  a  surface,  it  will  be  surrounded  on 
all  sides  by  diffusely  reflected  light  of  approximately  the 
same  color  and  power.  Finally,  we  observe  that  most 
specular  discontinuities  occur  in  places  where  the  re¬ 
flecting  surface  is  continuous.  It  has  been  shown  that 
image  irradiance  for  diffusely  reflected  light  can  be  used 
to  compute  local  descriptions  of  a  surface  to  at  least 
second  order  |I0).  It,  has  also  been  shown  that  image 
irradiance  for  specularly  reflected  light  can  be  used  to 
compute  similar  local  surface  descriptions  [7],  At  spec¬ 
ular  image  irradiance  discout.iiiiiil.ies,  we  usually  expect 
the  first  few  derivatives  of  the  surface  to  he  continuous. 
Thus  by  comparing  the  local  surface  descriptions  gener¬ 
ated  by  [10]  and  [7]  we  have  another  way  to  verify  the 
presence  of  a  specular  image  irradiance  discontinuity. 


608 


10.  Generic  Classification  of  Materials 

In  this  section  and  the  next,  we  describe  procedures 
for  using  color  to  extract  distinctive  invariant  proper¬ 
ties  of  objects  from  images.  In  this  section,  we  show 
how  color  can  be  used  to  recover  a  symbolic  descrip¬ 
tion  of  the  material  an  object  is  made  of.  In  section  II, 
we  describe  a  method  for  recovering  an  object’s  surface 
spectral  reflectance. 

Classifying  objects  according  to  material  is  impor¬ 
tant  because  material  is  an  invariant  property  of  an  ob¬ 
ject.  It  is  very  valuable,  for  example,  to  be  able  to  decide 
that  an  object  is  metal  rather  than  plastic  or  painted 
wood  rather  than  dyed  cloth. 

We  showed  in  section  4  that  an  important  property 
of  a  material  is  whether  it  is  optically  homogeneous  or 
optically  inhomogeneous.  By  examining  the  physics  of 
reflection  described  in  section  2,  we  have  derived  a  pro¬ 
cedure  for  classifying  a  material  as  cither  homogeneous 
or  inhomogeneous. 

Homogeneous  materials  reflect  light  only  from  the 
surface.  The  color  of  this  reflected  light  is  determined 
by  the  Fresnel  equations  from  the  complex  index  of  re¬ 
fraction  of  the  material  as  a  function  of  wavelength.  For 
a  single  color  of  illumination  /.(A),  the  color  of  light 
reflected  from  a  homogeneous  material  will  be  nearly 
constant  with  only  slight  variations  due  to  changing  ge¬ 
ometry. 

Inhomogeneous  materials  both  reflect  light  from  the 
surface  and  scatter  light  from  the  body  of  the  material. 
The  color  of  the  light  reflected  from  the  surface  is  deter¬ 
mined  by  the  index  of  refraction  of  the  vehicle.  The  color 
of  the  light  scattered  from  the  body  is  determined  by  the 
selective  absorption  properties  of  the  colorant  particles 
embedded  in  the  vehicle.  In  general,  the  color  of  the 
surface  reflected  light  will  be  different  from  the  color  of 
the  body  scattered  light.  Therefore,  given  a  single  color 
of  illumination  L( A),  there  will  be  two  distinct  colors  of 
light  reflected  from  an  inhomogeneous  material. 

Using  the  technique  discussed  in  section  9,  we  nre 
able  to  find  image  irradiance  discontinuities  correspond¬ 
ing  to  the  places  where  the  power  of  the  specularly  re¬ 
flected  light  becomes  significant.  Using  our  metric  for 
color  space  (section  7),  we  can  examine  whether  these 
image  irradiance  discontinuities  coincide  with  disconti¬ 
nuities  in  color  space.  If  a  color  discontinuity  is  not 
delected,  there  is  strong  evidence  for  a  homogeneous 
material.  If  we  do  detect  a  color  discontinuity,  then  the 
material  is  probably  optically  inhomogeneous. 

We  see  that  in  most  situations,  it  is  possible 
to  distinguish  homogeneous  materials  from  inhomoge¬ 
neous  materials  using  techniques  in  color  space.  Once 
(he  homogeneous-inhomogeneous  classification  has  been 
made,  it  is  possible  to  use  color  to  distinguish  different 


homogeneous  materials,  e.g.  aluminum  and  copper,  and 
to  distinguish  different  inhomogeneous  materials,  e.g. 
white  plastic  and  red  plastic.  Our  method  to  .achieve 
this  additional  level  of  classification  is  based  on  recover¬ 
ing  surface  spectral  reflectance.  We  describe  our  method 
in  the  next  section. 

1  1.  Recovering  Surface  Spectral  Reflectance 

Another  invariant  property  of  an  object  which  is 
valuable  foi  recognition  is  the  object’s  surface  spectral 
reflectance.  Unfortunately,  surface  spectral  reflectance 
is  not  determined  by  the  spectral  distribution  of  the  light 
reflected  by  a  surface.  It  is  this  reflected  light  which  is 
directly  sensed  bv  a  vision  system.  The  light  which  is  re¬ 
flected  by  a  surface  is  the  product  of  the  spectral  distri¬ 
bution  of  the  incident  light  and  the  spectral  reflectance 
of  the  surface.  To  recover  the  spectral  reflectance  of 
a  surface,  some  mechanism  must  be  available  to  factor 
out  the  effects  of  the  incident  light.  Many  experiments 
have  shown  that  the  human  vision  system  is  capable 
of  making  this  computation.  This  ability  of  humans  to 
see  objects  as  having  a  constant  color  despite  varying 
illumination  conditions  is  called  color  constancy. 

Many  theories  have  been  advanced  to  explain  color 
constancy  |2].  Our  approach  is  based  on  the  physics  of 
reflection.  Another  class  of  approaches  views  the  task 
as  an  underconslraiued  mathematical  problem  [10],  [25], 
These  approaches  identify  the  assumptions  about  inci¬ 
dent  illumination  and  surface  spectral  reflectance  which 
are  required  to  make  color  constancy  possible  from  a 
purely  computational  point  of  view.  In  other  work,  psy¬ 
chologists  have  suggested  that  the  eye  selectively  adapts 
to  the  color  of  the  ambient  light.  Fxperimcnls  have 
shown  that  this  selective  adaptation  might  be  partly  re¬ 
sponsible  for  human  color  constancy  [8], 

Our  method  for  recovering  surface  spectral  re¬ 
flectance  is  applicable  to  instances  of  surfaces  which  are 
illuminated  by  the  same  spectral  distribution  of  light  as 
that  which  illuminates  an  inhomogeneous  object  in  the 
scene.  This  condition  is  quite  general,  anil  is  almost  al¬ 
ways  satisfied  in  real  situations  where  a  small  number 
of  different  illuminants  contribute  to  the  image  forming 
process. 

We  recall  from  section  4  that  for  inhomogeneous 
materials  /v, ,(A)  ~  0  and  n(A)  is  nearly  constant  across 
the  visible  spectrum.  From  the  Fresnel  equations,  the 
specular  reflectance  of  an  inhomogeneous  material  is 
a  constant  function  of  wavelength  for  fixed  geometry. 
Moreover,  the  Fresnel  equations  tell  us  that  (lie  specular 
reflectance  for  fixed  wavelength  is  constant  with  respect 
to  geometry  for  almost  all  incidence  nngles.  Therefore, 
for  inhomogeneous  objects  the  Fresnel  component  of  the 
reflectance  can  be  regarded  as  constant  with  respect  to 
both  geometry  and  wavelength. 


609 


Given  the  physical  segmentation  lechni<|iie  section 
!))  ami  our  techni<|ne  for  classifying  inhomogeneous  ma¬ 
terials  (section  10),  we  can  recover  surface  spectral  re- 
lloctance  up  to  a  multiplicative  constant.  We  use  the 
previously  described  procedures  to  locate  a  specular- 
body  reflection  boundary  on  an  inhomogeneous  object. 
Prom  (2.12)  and  (2.13),  the  measured  function  f( A)  is 
given  by 

/(A)  =  \I<s(  A)  I  Hu  (A)]  /.(A)  (11.1). 

On  Hli«'  body  reflec  tion  side  of  the  boundary,  //,v(A)  -  0 
giving 

/'(A)  =  «a(A)/.(A)  (11.2) 

where  lij j(A)  can  be  considered  to  be  the  same  as  in 
(11.1)  since  /?u( A)  changes  slowly  with  respect  to  ge¬ 
ometry.  We  can  solve  for  Iig(\)L(\)  using 

/f.s(A)/,(A)  =  /(A)  -  /'(A)  (11.3) 

lint  f?.s-(A)  is  constant  with  respect  to  A  and  geometry. 
Let  /?.s(A)  -  k.  Therefore,  using  (11.3)  we  can  compute 
/.(A)  up  to  the  constant  k  by 

kL(X)  -  /(A)  -  /'(A)  (11.4) 


1  rom  (U.2),  /{/;( A)  can  now  be  computed  up  to  the 
constant  l/k  by 


(11.5) 


lor  a  homogeneous  material  in  the  scope  of  L(A),  we 
have 


/(A)=  fls(A)L(A)  (11.6) 

because  lia( A)  -  0.  Thus  we  can  compute  Rs( A)  to 
within  l/k  by 


I**™  - 


/(*)_ 
kL{  A) 


(11.7) 


Therefore,  using  (11.5)  and  (11.7)  we  can  compute  the 
surface  spectral  reflectance  for  any  surface  illuminated 
bv  /,( A)  up  to  a  multiplicative  constant.  Kxamples  of 
the  performance  of  this  method  on  real  images  will  be 
given  in  section  12. 

It  should  not  be  surprising  that  there  is  a  funda¬ 


mental  ambiguity  in  the  computed  speetral  reflectance 
corresponding  to  the  constant  k.  From  (4.1) 

/(A)  n  L(A)/<(A)  (11.8). 


We  see  that  an  arbitrary  constant  t  can  be  introduced 
into  both  L( A)  and  R{\)  such  that  the  resultant  /(A) 
will  be  indistinguishable  from  /(A)  in  (11.8): 

/{A)-  ( -^  ■)  (««(A))  (11.9) 

Thus  without  using  additional  assumptions,  we  cannot 
expect  to  determine  U( A)  better  than  to  within  a  multi¬ 
plicative  constant.  We  note  that  physical  color  (section 
7)  is  independent  of  those  Sl  iding  constants. 

12.  Experimental  Results 

A  simple  laboratory  setup  has  been  used  to  test  our 
color  methods.  We  digitize  rolor  images  using  a  solid- 
slate  camera  and  four  Wli  ATTEN  gelatin  filters.  The 
camera  is  equipped  with  an  infrared  ciilolf  filter.  Figure 
5  illustrates  the  spectral  characteristics  of  our  sensors. 


Figure  5.  The  camera  and  filter  transmission  functions 

The  large  amplitude  curve  indicates  the  speetral  sensi¬ 
tivity  of  our  camera.  The  four  smaller  curves  are  the 
spectral  transmission  functions  of  our  filters  multiplied 
by  the  spectral  sensitivity  of  the  camera.  These  four 
smaller  curves  correspond  to  the  functions  /u(A),  / j(A), 
/j ( A ) ,  and  /,( A)  or  section  5. 

Two  different  light  sources  have  been  used  for  ex¬ 
periments.  One  is  a  tungsten  halogen  lamp  of  color  tem¬ 
perature  3-100  "K  which  is  typical  of  indoor  illumination. 
The  other  lamp  lias  color  temperature  4800”  and  is  in¬ 
tended  to  simulate  daylight. 


610 


Several  simple  objects  have  been  used  to  test  our 
color  methods.  Our  objects  include  plastic  cups,  metal 
cylinders,  and  painted  wooden  blocks. 

In  Figures  0-7,  we  show  the  performance  of  our 
algorithms  on  color  images  of  plastic  cups  illuminated 
by  the  18(H)" A'  color  temperature  lamp.  Since  the  im¬ 
age  irradiance  corresponding  to  the  specular  reflection  is 
markedly  larger  than  the  image  irradiance  correspond¬ 
ing  to  the  diffuse  reflection,  the  physical  segmentation 
process  of  section  0  easily  locates  the  specular-diffuse 
boundaries.  The  method  of  section  5  is  then  used  to  re¬ 
cover  the  function  /(A)  for  both  the  specularly  reflected 
light  and  the  diffusely  reflected  light.  Figure  6(a)  is 
/(A)  for  the  Fresnel  reflection  from  a  blue  cup.  It  agrees 
well  with  (lie  actual  color  of  tlu-  light  source.  Figure 
6(b)  is  /(A)  for  the  diffuse  reflection  from  the  blue  cup. 
From  the  large  color  difference  between  Figure  6(a)  and 
F'igure  6(b),  the  algorithm  of  section  10  easily  is  able 
to  identify  the  plastic  cup  as  being  made  of  an  inho¬ 
mogeneous  material.  We  remark  that  it  would  be  very 
difficult  to  infer  that  the  cup  is  blue  by  simply  inspect¬ 
ing  the  color  of  the  reflected  light  /(A)  in  Figure  6(b). 
In  fad.  the  largest  amount  of  power  is  in  the  red  part  of 
the  visible  spectrum  (near  700  nm).  To  determine  the 
color  of  the  cup  (as  distinct  from  the  color  of  the  light 
reflected  from  the  cup),  we  must  compute  the  surface 
spectral  reflectance.  This  is  done  using  the  method  of 
section  II.  Figure  6(c)  shows  the  spectral  reflectance 
computed  for  the  blue  cup.  From  Figure  6(c),  we  can 
tell  that  the  cup  is  blue.  Figures  "(a),  7(b),  and  7(c) 
show  the  performance  of  our  algorithms  on  a  color  im¬ 
age  of  a  red  plastic  cup.  We  see  that  our  algorithms  are 
able  to  correctly  determine  the  object’s  material  (from 
Figures  7(a)  and  7(b))  and  spectral  reflectance  (Figure 
"(«•))• 


Figure  6(a).  Fresnel  Reflection  from  blue  cup 


Figure  6(b).  Diffuse  Reflection  from  blue  cup 


Figure  6(c).  Computed  Reflectance  for  blue  cup 


Figure  7(a).  Fresnel  Reflection  front  red  cup 


611 


Figure  7(b).  Diffuse  Reflection  from  red  cup 


Our  color  algorithms  are  part  of  a  general  color  un¬ 
derstanding  system.  In  constructing  this  system,  we 
have  addressed  the  problems  of  color  recovery  and  color 
representation.  The  input  to  the  system  is  images  cap¬ 
tured  using  sensors  with  different  spectral  responses.  A 
robust  technique  has  been  developed  to  recover  physical 
color  from  these  images.  We  represent  physical  color  by 
points  in  a  metric  space.  As  we  have  shown,  the  prop¬ 
erties  of  our  representation  greatly  facilitate  using  color 
to  achieve  the  goals  of  a  general  vision  system. 

Acknowledgements 

This  work  has  been  supported  by  an  NSF  gradu¬ 
ate  fellowship,  AFOSR  contract  F33615-85-C-5106,  and 
ARPA  contract  N000-39-84-C-0211. 


Figure  ”(c).  Computed  Reflectance  for  red  cup 

13.  Conclusions 

In  this  paper,  we  have  analyzed  the  importance  of 
color  understanding  in  a  general  vision  system.  Starting 
from  geneml  physical  models,  we  have  shown  that  color 
can  play  an  important  role  in  both  the  classification  of 
materials  and  ill  the  recognition  of  objects.  An  appro¬ 
priate  use  of  color,  therefore,  can  significantly  extend 
the  capabilities  of  a  vision  system. 

We  have  developed  two  color  algorithms  which  ex¬ 
tract  invariant  properties  of  objects  from  images.  The 
first  algorithm  classifies  objects  according  to  material. 
The  second  algorithm  recovers  an  object’s  surface  spec¬ 
tral  reflectance.  Both  algorithms  have  been  imple¬ 
mented  and  consistently  produce  correct  results  on  real 
images. 


References 

jl]  Barrow,  H.G.,  and  Tenenbaum,  J.M.,  “Recovering 
Intrinsic  Scene  Characteristics  from  Images”,  in  A.R. 
Hanson  and  E.M.  Riseman  (Eds.),  Computer  Vision 
Systems .  Academic  Press,  New  York,  1978,  3-26. 

[2]  Beck,  J.,  Surface  Color  Perception  ,  Cornell  U.  Press, 
Ithaca,  N.Y.,  1972. 

[3]  Binford,  T.O.,  “Inferring  Surfaces  from  Images,”  Ar¬ 
tificial  Intelligence  17,  1981,  205-245. 

[4]  Born,  M.  and  Wolf,  E.,  Principles  of  Optics,  Perga- 
mon  Press,  New  York,  1959. 

|5]  Conrant,  R.  and  Hilbert,  D.,  Methods  of  Mathemat¬ 
ical  Physics ,  Volume  1,  Wiley  &  Sons,  New  York,  1953. 

[6]  Gerslion,  R.  and  Jepson,  A.  and  Tsotsos,  J.,  “  Am¬ 
bient  Illumination  and  the  Determination  of  Material 
Changes”,  Journal  of  the  Optical  Society  of  America  A, 
3(10),  1980,  1700-1707. 

[7]  Healey,  C.  and  Binford,  T.O.,  “Local  Shape  from 
Specularity”,  Proceedings  of  ARPA  Image  Understand¬ 
ing  Workshop,  USC,  1987. 

[8]  Helson,  II.,  Adaptation- Level  Theory ,  Harper  and 
Row,  New  York,  1964. 

[9]  Herskovits,  A.  and  Binford,  T.O.,  “On  Boundary  De¬ 
tection”,  MIT  AI  Memo  183,  1970. 

[10]  Horn,  B.K.P.,  “Obtaining  Shape  from  Shading  In¬ 
formation,”  in  The  Psychology  of  Computer  Vision,  (P. 
Winston,  Ed.),  McGraw-Hill,  New  York,  1975. 

[11]  Horn,  B.K.P.,  “Understanding  Image  Intensities," 
Artificial  Intelligence  8,  1977,  201-231. 

[12]  Horn,  B.K.P.  and  Sjoberg,  R.,  “Calculating  the  Re¬ 
flectance  Map,”  Applied  Optics,  Vol.  18,  No.  11,  June 
1979,  1770-1779. 


612 


[13]  Judd,  D.  and  Wyszecki,  G.,  Color  in  Business,  Sci¬ 
ence,  and  Industry,  Wiley  &  Sons,  New  York,  1975. 

[14]  Kanthack,  R.,  Tables  of  Refractive  Indices,  Vol.  II., 
App.  Ill,  Hilger,  London,  1921. 

[15]  Kubelka,  P.  and  Munk,  F.,  “Ein  Beitrag  sur  Optik 
der  Farbanstriche" ,  Z.  tech.  Physik. ,  12,  1931,  593. 

[16]  Maloney,  L.  and  Wandell,  B.,  “Color  Constancy: 
a  method  for  recovering  surface  spectral  reflectance,” 
Journal  of  the  Optical  Society  of  America ,  Vol.  3,  Jan¬ 
uary  1986,  29-33. 

[17]  Nalwa,  V.  and  Binford,  T.O.,  “On  Detecting 
Edges,”  IEEE  Transactions  on  Pattern  Analysis  and 
Machine  Intelligence  ,  Volume  8,  No.  6,  November  1986, 
699-714. 

[18]  Orchard,  S.,  “Reflection  and  Transmission  of  Light 
by  Diffusing  Suspensions,”  Journal  of  the  Optical  Soci¬ 
ety  of  America  Volume  59,  Number  12,  December  1969, 
1584-1597. 

[19]  Rcichman,  J.,  “Determination  of  Absorption  and 
Scattering  Coefficients  for  Nonhomogeneous  Media.  1: 
Theory,”  Applied  Optics,  Vol.  12,  No.  8,  August  1973, 
1811-1815. 

[20]  Rubin,  J.  and  Richards,  W.,  “Color  Vision  and  Im¬ 
age  Intensities:  When  are  Changes  Material?”,  MIT  AI 
Memo  G31,  May  1981. 

[21]  Shafer,  S.,  “Using  Color  to  Separate  Reflection 
Components”,  University  of  Rochester  TR  136,  April 
1984. 

[22]  Sparrow,  E.  and  Cess,  R.,  Radiation  Heat  Transfer, 
McGraw-Hill,  New  York,  1978. 

[23]  Torrance,  K.  and  Sparrow,  E.,  “Theory  for  Off- 
Specular  Reflection  from  Roughened  Surfaces,”,  Journal 
of  the  Optical  Society  of  America ,  57  (1967),  1105-1114. 

[24]  Witken,  A.,  “Intensity-based  Edge  Classification,” 
Proceedings  of  AAAI,  1982,  36-41. 

[25]  Yuille,  A.,  “A  Method  for  Computing  Spectral  Re¬ 
flectance”,  MIT  AI  Memo  752,  December  1984. 


613 


Using  a  Color  Reflection  Model 
to  Separate  Highlights  from  Object  Color1 

Gudrun  J.  Klinker,  Steven  A.  Shafer,  and  Takeo  Kanade 


Department  of  Computer  Science,  Carnegie  Mellon,  Pittsburgh,  PA  15213,  USA 


Abstract 

Current  methods  for  image  segmentation  are  confused  by 
artifacts  such  as  highlights,  because  they  are  not  based  on  any 
physical  model  of  these  phenomena.  In  this  paper,  we  present  an 
approach  to  color  image  understanding  that  accounts  for  color 
variations  due  to  highlights  and  shading.  Based  on  the  physics  of 
reflection  by  dielectric  materials,  such  as  plastic,  we  show  that  the 
color  of  every  pixel  from  an  object  can  be  described  as  a  linear 
combination  of  the  object  color  and  the  highlight  color.  According 
to  this  model,  all  color  pixels  from  one  object  form  a  planar  cluster 
in  the  color  space  whose  shape  is  determined  by  the  object  and 
highlight  colors  and  by  the  object  shape  and  illumination  geometry. 
We  present  a  method  which  exploits  the  color  difference  between 
object  color  and  highlight  color,  as  exhibited  in  the  cluster  shape, 
to  separate  the  color  of  every  pixel  into  a  matte  component  and  a 
highlight  component.  This  generates  two  intrinsic  images,  one 
showing  the  scene  without  highlights,  and  the  other  me  showing 
only  the  highlights.  The  intrinsic  images  may  F  u  ..eful  tool  for  a 
variety  of  algorithms  in  computer  vision  that  cannot  detect  or 
analyze  highlights,  such  as  stereo  visio.  ,  -  uon  analysis,  shape 
from  shading,  and  shape  from  highlights.  We  have  applied  this 
method  to  real  images,  in  a  laboratory  environment,  and  we  show 
these  results  and  discuss  some  of  the  pragmatic  issues  endemic  to 
precision  color  imaging. 

1 .  Introduction 

When  we  look  at  an  image,  we  can  interpret  what  we  see  as  a 
collection  of  shiny  and  matte  surfaces,  smooth  and  rough, 
interacting  with  light,  shape,  and  shadow.  However,  computer 
vision  has  not  yet  been  successful  at  deriving  a  similar  description 
of  surface  and  illumination  properties  from  an  image.  The  key 
reason  for  this  failure  has  been  a  lack  of  models  or  descriptions  rich 
enough  to  relate  pixels  and  pixel-aggregate  properties  to  these 
scene  characteristics.  In  the  past,  most  work  with  color  images  has 
considered  object  color  to  be  a  constant  property  of  an  object. 
Color  variation  on  an  object  was  attributed  to  noise.  However,  in 
real  scenes,  color  variation  depends  to  a  much  larger  degree  on  the 
optical  reflection  properties  of  the  scene,  which  cause  the 
perception  of  object  color,  highlights,  shadows  and  shading  [3, 9], 
fii  a  consequence,  color  variation  cannot  be  regarded  to  be  of  a 
merely  statistical  nature.  On  the  contrary,  it  exhibits  characteristics 
•nat  can  be  determined  and  used  for  color  vision. 


'This  material  la  baaed  upon  work  supported  by  the  National  Science  Foundation 
under  Grant  DCR-8419990  and  by  the  Defense  Advanced  Reaearch  Projects  Agency 
(DOO),  ARPA  Order  No.  4876,  monitored  by  the  Air  Force  Avionics  Laboratory  under 
contract  F33615-84-K-1520.  Any  opinions,  findings,  and  conclusions  or 
recommendations  expressed  in  this  publication  are  those  of  the  authors  and  do  not 
naceaaarily  reflect  the  views  of  the  National  Science  Foundation  or  the  Defense 
Advanced  Research  Projects  Agency  or  the  US  Government 


This  paper  presents  an  approach  to  color  image  understanding 
that  accounts  for  color  variations  due  to  highlights  and  shading. 
We  use  a  reflection  model  which  describes  the  color  of  every  pixel 
from  an  object  as  a  linear  combination  of  the  object  color  and  the 
highlight  color  [10].  According  to  our  model,  all  color  pixels  from 
one  object  form  a  planar  cluster  in  the  color  space.  The  cluster 
shape  is  determined  by  the  object  and  highlight  colors  and  by  the 
object  shape  and  illumination  geometry. 

We  present  a  method  that  exploits  the  color  difference  between 
object  color  and  highlight  color,  as  exhibited  in  the  cluster  shape, 
to  separate  the  color  of  every  pixel  into  a  matte  component  and  a 
highlight  component.  Our  method  generates  two  intrinsic  images, 
one  showing  the  scene  without  highlights,  and  the  other  one 
showing  only  the  highlights.  These  intrinsic  images  can  be  a  useful 
tool  for  a  variety  of  algorithms  in  computer  vision  that  cannot  detect 
or  analyze  highlights,  such  as  stereo  vision,  motion  analysis,  shape 
from  shading  and  shape  from  highlights  [2, 3, 1 1].  We  demonstrate 
the  applicability  of  our  method  to  real  color  images. 

We  begin  by  introducing  the  dichromatic  reflection  model  which 
describes  the  interaction  of  light  with  opaque  dielectric  materials. 
Using  this  model,  we  describe  the  color  variation  on  an  object  as  a 
function  of  the  reflection  properties  of  the  materials,  object  shapes 
and  sensor  characteristics.  Next,  we  demonstrate  how  our 
reflection  model  can  be  used  for  the  analysis  of  color  images.  We 
present  methods  to  determine  the  illumination  color  and  to  detect 
and  remove  highlights  from  real  color  images.  Finally,  we  discuss 
the  implications  and  the  future  direction  of  our  work. 

2.  The  dichromatic  reflection 
model 

When  we  look  at  a  glossy  object,  we  usually  see  the  reflected  light 
as  composed  of  two  colors  that  typify  the  highlight  areas  and  the 
matte  object  parts.  The  dichromatic  reflection  model  describes  this 
phenomenon  for  scene  configurations  in  which  a  single  light 
source  of  an  arbitrary  color  illuminates  opaque  dielectric  materials 
[10],  As  we  will  show  later  in  this  paper,  this  model  accounts  well 
for  real  color  data  under  suitable  conditions. 

When  light  hits  an  object  of  opaque  dielectric  material,  the 
material  interface  immediately  reflects  some  percentage  of  the 
light,  according  to  Fresnel's  law  of  reflection.  In  general,  the 
reflected  light  has  approximately  the  same  color  as  the  light  source. 
The  remaining  percentage  of  the  incident  light  penetrates  into  the 
material  body,  which  then  scatters  the  light  and  absorbs  it  at  some 
wavelengths,  before  it  reemits  the  rest  [4, 6, 12].  The  color  of  this 
light  is  determined  by  the  illumination  color  and  the  reflection 
properties  of  the  material.  Common  names  for  the  reflection 
process  at  the  interface  are  the  terms  specular  reflection,  highlight 
or  gloss,  whereas  the  reflection  process  in  the  material  body  is 
generally  called  diffuse  reflection  or  matte  color.  We  refer  to  the 


614 


two  processes  as  interface  and  body  reflection  because  those 
terms  make  a  more  precise,  physical  distinction  between  the 
reflection  processes  than  the  geometry-based  terms  specular  and 
diffuse  reflection.  We  will  use  this  terminology  and  the 
corresponding  terms  highlights  and  matte  object  color  throughout 
the  paper. 


Interface 

reflection 


Figu  re  2- 1 :  Light  reflection  at  dielectric  materials 


Based  on  the  above  discussion,  the  dichromatic  reflection  model 
describes  the  spectral  radiance  L(\,i,e,g)2 of  a  point  in  the  scene  as 
a  sum  of  an  interface  reflection  component  Lp^i.e.g)  and  a  body 
reflection  component  Lb(k,i.e.g).  The  model  assumes  that  the 
spectral  properties  of  illumination  and  reflection  on  an  object  are 
independent  of  the  orientation  of  the  surface,  which  is  a  reasonable 
approximation  {’0].  We  thus  decompose  each  of  the  two  reflection 
components  int*-  a  spectral  composition  c(\)  that  describes  a  color, 
and  a  magnitude,  ni(i.r,g)€|0.1j.  that  describes  a  geometric  scale 
factor: 

UK.  i.e.g)  -  m(fte, g)ct(K)  +  mfc  (i.e.g)  cb  (K)  (2.1) 

When  a  color  TV  camera  records  an  image  of  a  scene,  it  generally 
uses  three  color  primaries  to  represent  the  spectrum  of  the  light 
that  is  reflected  from  the  objects  towards  the  camera.  In  the  image 
formation  process,  the  camera  transforms  the  light  of  the  incoming 
ray  at  pixel  position  (x.y)  via  tristimulus  integration  from  an  infinite 
light  spectrum  into  a  triple  of  color  values,  C (x,y)  =  [r,g,b].  This 
process  sums  the  amount  of  light  at  each  wavelength  weighted  by 
the  transmittance  of  the  color  filters  and  the  responsivity  of  the 
camera  at  each  wavelength.  Because  this  is  a  linear 
transformation,  and  because  the  photometric  angles,  i,  e,  and  g 
depend  on  x  and  y,  the  dichromatic  reflection  model  can  be  applied 
to  color  pixel  values.  This  allows  us  to  describe  the  color  pixel 
value  C(x,y)  as  a  linear  combination  of  the  vectors  representing  the 
colors  of  interface  reflection  C(  and  body  reflection  Cb  at  the 
corresponding  point  in  the  scene:  point: 

C  (x.y)  *  m(  (i.e.g)  C,  +  mb  (i.e.g)  Cb  (2.2) 

In  this  equation,  the  color  vectors  C,  and  Cb  are  constant  for  a 
surface,  and  the  scale  factors  mf  and  mb  vary  at  each  pixel. 


\  e.  and  g  dumb*  Bw  angM*  of  IgM  Incldonco  md  oxttvica  and  the  phaaa 
angle,  Alatha  wovotanglh  parimeWr 


3.  Color  variation  and  object 
shape 

In  the  previous  section  we  have  shown  how  the  pixel  color  C  (x.y) 
depends  on  the  optical  properties  of  the  scene.  We  will  now 
discuss  the  relationship  between  the  colors  of  all  pixels  on  an 
object.  As  a  means  to  model  the  color  variation  over  an  entire 
object,  we  use  a  color  histogram  in  the  color  space,  which  is  the 
projection  of  the  colors  of  all  pixels  from  the  object  into  the  color 
space. 

The  dichromatic  reflection  model  assumes  that  there  is  a  single 
light  source  in  the  scene,  without  ambient  light  or  inter- reflection 
between  objects.  Under  this  assumption,  the  colors  of  all  pixels 
from  an  object  are  linear  combinations  of  the  same  interface  and 
body  reflection  colors  C(  and  Cb.  Color  variation  within  an  object 
area  thus  depends  only  on  the  geometric  scale  factors  m.  and  mb 
while  C(  and  Cb  are  constant.  Accordingly,  C,  and  span  a 
dichromatic  plane  in  the  color  space,  and  the  colors  of  all  pixels 
from  one  object  lie  in  this  plane. 

Within  the  dichromatic  plane,  the  color  pixels  form  a  dense 
cluster.  There  exists  a  close  relationship  between  the  shape  of 
such  a  color  cluster  and  the  geometric  properties  of  body  and 
interface  reflection  and  the  shape  of  the  object.  We  can  use  this 
relationship  to  determine  characteristic  features  of  the  color 
clusters.  As  an  aid  to  intuition,  we  will  assume  that  body  reflection 
is  approximately  Lambertian  and  that  interlace  reflection  is 
describable  by  a  function  with  a  sharp  peak  around  the  angle  of 
perfect  mirror  reflection.  This  is  a  simplified  view  of  the  reflection 
processes  in  real  scenes.  However,  as  we  will  demonstrate,  it  is 
sufficient  for  our  analysis  of  color  images. 


Figu  re  3- 1 :  The  shape  of  the  color  cluster  for  a  cylindrical  object 


Figure  3-1  shows  a  sketch  of  a  shiny  cylinder.  The  left  part  of  the 
figure  displays  the  magnitudes  of  the  body  and  interface 
components  as  curves  showing  the  loci  of  constant  body  or 

interface  reflection.  The  right  part  of  the  figure  shows  the 
corresponding  color  cluster  in  the  dichromatic  plane.  This 
represents  the  configuration  observed  in  a  histogram  of  pixel 
values  in  the  color  space.  To  relate  the  terminology  of  the 
dichromatic  reflection  model  to  the  shape  of  the  color  clusters,  we 
classify  the  color  pixels  as  matte  pixels,  highlight  pixels  or  clipped 
color  pixels.  The  following  paragraphs  discuss  the  characteristic 
features  of  each  of  these  classes. 


615 


Matte  pixels  are  projections  of  points  in  the  scene  that  exhibit 
only  body  reflection  in  the  direction  of  the  viewer.  The  color  of 
such  pixels  is  thus  determined  by  the  color  of  body  reflection, 
scaled  according  to  the  geometrical  relationship  between  the  local 
surface  normal  of  the  object  and  the  viewing  and  illumination 
directions.  Consequently,  the  colors  of  the  matte  pixels  form  a 
matte  line  in  the  color  space,  in  the  direction  of  the  body  reflection 
vector  Cb- 

Highlight  pixels  are  projections  of  scene  points  that  exhibit  both 
body  reflection  and  interface  reflection  in  the  viewing  direction. 
The  colors  of  all  pixels  in  a  highlight  area  that  lie  on  a  line  of 
constant  body  reflection  vary  only  in  their  respective  amounts  of 
interface  reflection.  The  colors  of  these  pixels  thus  form  a  straight 
highlight  line  in  the  color  space  that  starts  at  the  matte  cluster  at  the 
position  that  is  determined  by  their  body  reflection  component  mbH. 
The  direction  of  the  highlight  line  is  determined  by  the  interface 
reflection  vector  Cf.  Combined  with  the  neighboring  highlight 
pixels  of  slightly  different  amounts  of  body  reflection,  ail  highlight 
pixels  form  a  highlight  cluster  in  the  color  space  that  has  the  shape 
of  a  skewed  wedge.  If  more  than  one  highlight  exists  on  an  object, 
each  of  them  describes  a  highlight  line  in  the  color  space  (see 
Figure  3-2). 


«|,C,  J  ".«(”) 

Figure  3-2:  Color  cluster  shapes  under  changing  illumination 
geometry 

The  combined  color  cluster  of  matte  and  highlight  pixels  thus 
looks  like  a  skewed  T  or  comb.  The  exact  shape  of  the  cluster 
depends  on  the  illumination  geometry.  If  the  phase  angle  g 
between  the  illumination  and  viewing  direction  at  a  highlight  is  very 
small,  the  incidence  direction  of  the  light  is  very  close  to  the  surface 
normal.  According  to  Lambert’s  law,  the  body  reflection 
component  is  maximal  at  such  object  points  and  thus,  the  starting 
point  mbH  for  the  highlight  line  is  at  the  tip  of  the  matte  line.  The 
wider  g  becomes  at  a  highlight,  the  smaller  is  the  amount  of 
underlying  body  reflection  and,  thus,  the  greater  is  the  distance 
between  the  tip  of  the  matte  line  and  the  starting  point  mbH  of  the 
highlight  line  (see  Figure  3-2).  At  the  same  time,  small  positional 
variations  on  the  object  (as  for  neighboring  pixels)  become  less 
influential  on  the  incidence  and  exitance  angles  i  and  e.  The 
highlight  thus  becomes  dimmer  and  spreads  out  over  a  larger  area, 
covering  a  larger  range  of  values  of  underlying  body  reflection.  As 
a  result,  the  highlight  line  grows  wider,  exhibiting  more  strongly  the 
shape  of  a  wedge. 

Clipped  color  pixels  are  highlight  pixels  at  which  the  light 
reflection  exceeds  the  dynamic  range  of  the  camera.  Depending  on 
the  color  of  the  object,  the  dynamic  range  may  be  exceeded  at 
some  points  in  one  color  band  but  not  in  the  other  two,  and  the 
highlight  cluster  then  bends  near  the  wall  of  the  color  cube  that 
describes  the  limit  of  sensitivity  of  that  color  band.  A  second  color 


band  may  saturate  at  some  brighter  object  points,  causing  the  color 
cluster  to  bend  again  at  the  edge  of  the  limiting  walls  of  both  color 
bands.  At  the  innermost  points  of  the  highlight  the  dynamic  range 
of  the  camera  may  even  be  exceeded  in  all  three  color  bands,  thus 
looking  white,  even  though  the  color  of  illumination  may  not  be 
white. 


Figure  3-3:  Shape  of  color  clusters  in  colorspace 


Summarizing  this  discussion,  the  general  shape  of  a  color  cluster 
is  displayed  in  Figure  3-3.  Although  the  color  cluster  of  a  specific 
object  may  not  fill  the  entire  parallelogram,  it  reliably  exhibits  some 
features  that  can  thus  be  used  and  searched  for  by  algorithms. 
Figure  3-3  displays  these  features  as  bold  lines.  Color  clusters 
generally  provide  a  matte  line  on  which  all  matte  pixels  of  an  object 
area  lie.  Depending  on  the  object  shape  and  the  illumination 
geometry,  color  clusters  have  some  number  of  highlight  lines.  We 
use  the  brightest  highlight  line  as  a  representative  of  all  these  lines. 
Further  lines  connected  to  the  highlight  lines  are  clipped  color 
lines.  An  algorithm  can  analyze  a  color  cluster  by  searching  for 
these  lines  in  the  color  space.  They  determine  the  general  shape  of 
the  parallelogram  and  the  orientation  of  the  dichromatic  plane. 


4.  Analysis  of  real  color  images 

There  exist  some  inherent  assumptions  in  the  dichromatic 
reflection  model  [10].  However,  as  we  will  illustrate,  the  model 
accounts  well  for  real  color  data  under  suitable  conditions.  We  also 
present  methods  for  color  image  analysis  that  use  the  shape  of  the 
color  clusters. 

4.1  Color  clusters  from  real  color  images 

We  have  taken  a  series  of  color  images  in  the  Calibrated  Imaging 
Laboratory  at  Carnegie-Mellon.  The  scene  consists  of  an  orange,  a 
green  and  a  yellow  plastic  cup  under  white  or  yellow  illumination. 
Black  curtains  on  the  walls  were  used  to  eliminate  ambient  light  in 
the  visible  spectrum  of  the  scene.  We  use  a  spectral  linearization 
method  to  compensate  for  the  non-linear  response  of  our  camera 
to  image  brightness.  The  upper  left  box  of  Figure  4-1  shows  the 
linearized  image  of  the  orange,  the  yellow  and  the  green  cup  under 
yellow  illumination. 


616 


\ 


Figu  re  4- 1 :  Color  clusters  and  dichromatic  planes  of  cups  under 
yellow  light 


We  use  a  graphical  display  program  which  takes  a  color  picture 
as  input  and  displays  the  colors  of  all  pixels  of  selected  object  areas 
as  color  clusters  in  the  color  space.  The  upper  right  box  of  Figure 
41  displays  the  color  histogram  of  the  pixels  from  the  marked  areas 
of  the  color  image  (in  the  upper  left  box)  in  the  color  space.  The 
color  space  is  shown  as  a  color  cube,  with  each  dimension  of  the 
cube  representing  the  intensity  scale  of  one  of  the  three  color 
primaries.  Its  origin  is  at  the  black  corner  of  the  cube. 

The  color  clusters  of  the  cups  each  lie  approximately  in 
dichromatic  planes.  Within  the  planes,  they  form  matte  and 
highlight  lines,  thus  demonstrating  that  real  color  data  follows  the 
theory  of  the  dichromatic  reflection  model.  The  color  clusters  also 
have  clipping  lines,  due  to  the  bright  intensity  of  the  yellow 
illumination,  as  reflected  from  the  middle  of  the  highlights.  Note 
that,  accordingly,  the  colors  in  the  middle  ot  the  highlight  areas 
look  white,  whereas  the  pixels  closer  to  the  highlight  boundaries 
are  yellow. 

4.2  Determining  the  color  of  illumination 

If  several  glossy  objects  of  different  color  are  illuminated  by  the 
same  light  source,  each  object  produces  a  dichromatic  plane. 
Because  all  of  these  dichromatic  planes  contain  the  same  interface 
reflection  vector  C ,  they  intersect  along  a  single  line  which  is  the 
color  of  the  illumination  (see  the  lower  left  box  of  Figure  4-1).  This 
fact  can  be  used  by  color  constancy  algorithms  [1 , 7]  that  try  to 
remove  the  influence  of  the  illumination  color  from  the  body 
reflection  component,  thus  "normalizing"  the  image  to  a  standard 
white  illumination. 

4.3  Detecting  and  removing  highlights 

We  have  developed  and  implemented  an  algorithm  that  uses  the 
shape  of  the  color  clusters  to  detect  and  remove  highlights  from 
color  images.  The  program  projects  the  pixels  of  selected  image 
areas  into  the  color  space  and  fits  a  dichromatic  plane  to  the  color 
data  from  each  image  area.  The  program  then  searches  within 
each  dichromatic  plane  for  the  matte  line,  the  brightest  highlight 
line  and  lines  of  clipped  colors.  These  lines  are  extracted  from  the 


dichromatic  plane  by  using  a  recursive  line  splitting  algorithm  [8], 
and  classified  as  the  matte  vector,  the  highlight  vector  and  clipped 
color  vectors.  The  program  assumes  that  the  line  starting  closest  to 
the  black  corner  is  the  matte  vector.  The  next  one  connected  to  it 
is  classified  as  the  highlight  vector.  The  remaning  lines  are 
assumed  to  describe  clipped  color  data. 

Each  color  pixel  of  the  image  is  then  broken  up  into  its  reflection 
components.  In  order  to  remove  interface  reflection,  the  program 
projects  the  color  of  every  pixel  onto  the  respective  dichromatic 
plane.  If  the  projected  color  is  close  to  a  clipped  color  vector,  it  is 
replaced  by  the  color  at  the  end  of  the  highlight  vector.  The 
algorithm  then  projects  the  color  of  every  pixel  along  the  highlight 
vector  onto  the  matte  line.  The  result  is  the  intrinsic  matte  image  of 
the  scene.  Conversely,  the  program  forms  the  intrinsic  highlight 
image  of  a  scene  by  projecting  the  color  of  every  pixel  along  the 
matte  vector  onto  a  line  that  is  parallel  to  the  highlight  vector  but 
goes  through  the  origin  of  the  color  space. 

The  program  has  been  applied  to  the  pictures  of  an  orange,  a 
yellow  or  a  green  plastic  cup  under  white  or  yellow  light.  Figure  4-2 
shows  the  results  we  obtained  from  running  the  algorithm  on  an 
image  of  the  orange  cup  under  white  light.  The  upper  boxes  show 
the  image  of  the  cup  and  the  color  cluster  that  is  generated  by  the 
pixels  from  the  marked  object  area.  The  lower  two  boxes  display 
the  resulting  intrinsic  images  of  our  algorithm.  The  left  box  shows 
only  the  highlight,  and  the  right  box  shows  the  object  without  the 
highlight.  The  results  of  applying  the  algorithm  to  the  other 
pictures  are  similar. 


l.ih  i  +  C  1  iyht 

sc.  ■ 

/ 

color  hif  +  oorafu 

p  i  c  .  1  i  r  i .  i  in  q  . 

'  P  I'M.' 

/ 

i  ri  +  €  r  t  =i  c  •=  r  •=  t  1  . 

body  r  t  f 1 . 

Figure  4-2:  Intrinsic  images  of  the  orange  cup  under  white 
illumination 


A  qualitative  inspection  of  the  intrinsic  images  reveals  that  our 
algorithm  is  able  to  separate  the  highlights  from  the  body  reflection 
component.  Note  that  the  highlight  image  displays  both  the 
highlight  from  the  middle  of  the  cup  and  the  small  amount  of  gloss 
that  is  reflected  from  the  handle  of  the  cup.  Flowever,  the  current 
algorithm  does  not  yet  detect  the  interface  reflection  color  reliably. 
The  scene  of  Figure  4-2  was  illuminated  by  white  light.  In  the  color 
cube,  the  white  light  corresponds  to  the  diagonal  direction,  from 
the  black  corner  to  the  white  corner.  Since  the  highlight  line  of  this 


617 


image  starts  with  an  already  high  amount  o(  maimy  body 
reflection,  the  diagonal  direction  of  white  light  is  hard  to  detect  and, 
in  this  image,  the  clipping  vector  could  not  be  distinguished  from 
the  highlight  sector.  For  this  reason,  ;he  computed  highlight  vector 
misses  the  red  component  and  the  highlight  image  looks  cyan.  The 
algorithm  can  be  improved  by  using  the  intersection  of  several 
dichromatic  planes  as  an  indication  of  the  highlight  vector  (see 
figure  4  1). 

The  potential  utility  or  the  intrinsic  images  stems  from  two  facts: 
Both  the  matte  and  the  highlight  image  have  simpler  geometric 
properties  than  does  intensity  in  a  monochrome  image,  that 
represents  a  weighted  sum  of  the  two:  and  the  matte  image  is 
relatively  insensitive  to  changes  in  viewpoint  from  one  image  to  the 
next. 


5.  Conclusion 

In  this  paper,  we  have  demonstrated  that  it  is  possible  to  analyze 
real  color  images  by  using  a  color  reflection  model.  Our  model 
accounts  for  highlight  reflection  and  matte  shading,  as  well  as  for 
the  limited  dynamic  range  of  cameras.  By  developing  a  physical 
description  of  color  variation  in  color  images,  we  have  developed  a 
method  to  separate  highlight  reflection  Irom  matte  object  reflection. 
The  resulting  two  intrinsic  images  are  promising  for  improving  the 
results  of  many  other  computer  vision  algorithms  that  cannot  detect 
or  analyze  highlights.  To  demonstrate  this,  we  plan  to  combine  our 
approach  with  a  photometric  stereo  system  to  guide  a  robot  arm 
15]. 

The  key  point  leading  to  the  success  of  this  work  is  our  modeling 
of  highlights  as  a  linear  combination  of  both  body  and  interface 
reflection.  In  contrast,  previous  work  on  highlight  detection  in 
images  has  generally  assumed  that  the  color  of  the  pixels  within  a 
highlight  is  completely  unrelated  to  the  object  color.  This 
assumption  would  result  in  two  unconnected  clusters  in  the  color 
space:  one  line  or  ellipsoid  representing  the  object  color  and  one 
point  or  sphere  representing  the  highlight  color.  Our  model  and 
our  color  histograms  demonstrate  that,  in  real  scenes,  a  transition 
area  exists  on  the  objects  from  purely  matte  areas  to  the  spot  that  is 
generally  considered  to  be  the  highlight.  This  transition  area 
determines  the  characteristic  shapes  of  the  color  clusters  which  is 
the  information  that  we  use  to  detect  and  remove  highlights.  This 
view  of  highlights  should  open  the  way  for  quantitative  shape-from- 
gloss  analysis,  as  opposed  to  the  current  binary  methods  based  on 
thresholding  intensity. 

Our  approach  to  color  image  analysis  may  influence  research  in 
other  areas  of  color  computer  vision.  The  color  histograms 
demonstrate  that  all  color  pixels  from  one  object  (material)  form  a 
single  cluster  in  the  color  space.  A  color  cluster  thus  groups  pixels 
into  areas  of  constant  material  properties,  independently  of 
illumination  geometry  influences,  such  as  shading  or  highlights. 
This  finding  can  be  used  by  color  segmentation  algorithms  to 
distinguish  material  changes  from  shading  or  highlight  boundaries. 
Furthermore,  the  shape  of  the  color  clusters  reveals  some 
information  about  the  illumination  geometry  and  the  object  shapes. 
We  are  currently  investigating  these  implications  of  the  model  to 
improve  color  image  understanding  methods,  and  we  are  also 
considering  extensions  of  the  model  to  account  for  other  material 
types  and  more  complex  illumination  conditions  including  ambient 
light,  light  sources  in  several  colors,  shadow  casting  and  inter¬ 
reflection  between  objects. 


Atthougn  tt.»  current  method  has  only  been  applied  in  a 
laboratory  setting,  its  initial  success  shows  the  value  of  modeling 
the  physical  nature  of  the  visuaf  environment.  Our  work  and  the 
work  of  others  in  this  area  may  lead  to  methods  that  wifi  free 
computer  vision  from  its  current  dependence  on  signal-based 
methods  for  image  segmentation. 

Acknowledgments 

We  would  like  to  thank  Ruth  Johnston- Feller  for  her  comments 
and  suggestions  and  Georg  Klinker  and  Richard  Szeliski  for 
commenting  previous  drafts  of  this  paper. 

References 

[1]  M.  D'Zmura  and  P.  Lennie. 

Mechanisms  of  color  constancy. 

Journal  of  the  Optical  Society  of  America  A  (JOS A- A) 
3(10):1662-1672,  October,  1986. 

[2]  L.  Dreschler  and  H.-H.  Nagel. 

Volumetric  Model  and  3D  Trajectory  of  a  Moving  Car 
Derived  from  Monocular  TV  Frame  Sequences  of  a 
Street  Scene. 

Computer  Graphics  and  Image  Processing  20:199-228, 

1982. 

[3]  S.K.P.  Horn. 

Understanding  Image  Intensities. 

Artificial  Intelligence  8(1 1  ):201  -231 , 1977. 

[4]  R.S.  Hunter. 

The  Measurement  of  Appearance. 

John  Wiley  and  Sons,  New  York,  1975. 

[5]  K.  Ikeuchi. 

Generating  an  Interpretation  Tree  From  a  CAD  Model  to 
Represent  Object  Configurations  For  Bin-Picking  Tasks. 
Technical  Report  CMU-CS-86-144,  Department  of  Computer 
Science,  Carnegie-Mellon  University,  Pittsburgh,  PA 
15213,  August,  1986. 

[6]  D.B.  Judd  and  G.  Wyszecki. 

Color  in  Business,  Science  and  Industry. 

John  Wiley  and  Sons,  New  York,  1975. 

[7]  H.-C.  Lee. 

Method  for  computing  the  scene-illuminant  chromaticity 
from  specular  highlights. 

Journal  of  the  Optical  Society  of  America  A  (JOSA-A) 

3(10):  1694- 1699,  October,  1986. 

[8]  T.  Pavlidis. 

Structural  Pattern  Recognition. 

Springer  Verlag,  Berlin,  Heidelberg,  New  York,  1977. 

[9]  J.M.  Rubin  and  W.A.  Richards. 

Color  Vision  and  Image  Intensities:  When  are  Changes 
Material  7 

Biological  Cybernetics  45:215-226, 1982. 

[10]  S.  A.  Shafer. 

Using  Color  to  Separate  Reflection  Components. 

COLOR  research  and  application  10(4):210-218,  Winter 

1985. 

Also  available  as  technical  report  TR  136,  University  of 
Rochester,  NY,  April  1984. 


618 


[11]  C.E.  Thorpe. 

FIDO:  Vision  and  Navigation  lor  a  Robot  Rover. 

PhD  thesis,  Computer  Science  Department,  Carnegie- 
Mellon  University,  December,  1984. 
available  as  technical  report  CMU-CS-84-168. 

[12]  S.J.  Williamson  and  H.Z.  Cummins. 

Light  and  Color  in  Nature  and  Art. 

John  Wiley  and  Sons,  New  York,  1983. 


619 


RECOGNIZING  UNEXPECTED  OBJECTS: 
A  PROPOSED  APPROACH 

Azriel  Rosenfeld 


Center  for  Automation  Research 
University  of  Maryland 
College  Park,  MD  20742 


ABSTRACT 

Humans  can  recognize  familiar,  but  unexpected 
objects,  belonging  to  highly  variable  object  classes,  in  a 
fraction  of  a  second — a  few  hundred  cycles  of  the  neural 
visual  “hardware”.  This  position  paper  suggests  compu¬ 
tational  techniques  that  could  serve  as  a  basis  for  this 
type  of  object  recognition  if  implemented  on  appropriate 
parallel  hardware. 

1.  INTRODUCTION 

Imagine  that  you  are  viewing  a  random  sequence  of 
slides,  each  showing  a  single  familiar  object  on  a  blank 
background.  The  objects  might  include  specific  kinds  of 
animals,  plants,  furniture,  household  implements,  tools, 
office  equipment,  alphanumeric  characters;  there  are  hun¬ 
dreds  of  possible  object  classes,  and  the  objects  in  each 
class  can  vary  widely.  Nevertheless,  if  an  object  is 
sufficiently  unambiguous,  you  have  no  trouble  “instantly” 
identifying  it,  even  if  the  slide  showing  it  is  displayed  for 
a  very  short  time.  Within  a  fraction  of  a  second  after  the 
light  from  a  slide  reaches  your  eyes,  you  are  ready  to 
name  the  object  appearing  in  the  slide. 

The  neurons  in  your  brain’s  visual  pathway  cannot 
perform  long  sequences  of  computations  in  a  fraction  of  a 
second;  during  that  time  any  one  neuron  can  fire  at  most 
on  the  order  of  100  times.  Evidently,  your  visual  system 
must  be  able  to  perform  complex  tasks  very  rapidly 
through  the  use  of  massive  parallelism.  In  particular,  it 
is  able  to  recognize  familiar,  but  unexpected  objects, 
belong  to  highly  variable  object  classes,  in  at  most  a  few 
hundred  cycles  of  its  “wetware”. 

Traditional  computational  approaches  to  object 
recognition  fall  many  orders  of  magnitude  short  of  this 
performance.  This  is  not  simply  because  they  are  imple¬ 
mented  on  single-processor  computers;  many  of  the  tradi¬ 
tional  techniques  are  inherently  sequential,  and  it  is  not 
obvious  how  they  could  be  speeded  up  to  the  required 
degree  even  through  massively  parallel  multiprocessing. 
The  purpose  of  this  position  paper  is  to  suggest  some 
computational  techniques  that  could  serve  as  a  basis  for 
rapid  recognition  of  objects,  if  implemented  on  suitable 
parallel  hardware. 

This  research  wss  supported  by  the  National  Science  Foundation  under  Grant 
DCR- 86-037 23. 

Note:  This  paper  is  frankly  speculative;  many  of  the  ideas  in  it  are  still  in  their  for¬ 
mative  stages.  It  is  being  published  in  this  preliminary  form  in  order  to  invite 

com  menu,  criticisms,  and  counterexamples. 


In  the  real  world,  object  recognition  is  greatly  aided 
(or  occasionally  misled)  by  expectations  and  by  context. 
Thus  a  computational  model  for  real-world  object  recog¬ 
nition  should  be  strongly  goal-directed.  But  as  our  ran¬ 
dom  slide  show  experiment  demonstrates,  rapid,  accurate 
recognition  is  also  possible  in  the  absence  of  prior  expec¬ 
tations  or  contextual  clues.  We  deal  in  this  paper  only 
with  this  type  of  recognition. 

Section  2  of  this  paper  reviews  traditional  paradigms 
for  characterizing  and  recognizing  complex  classes  of 
objects,  and  points  out  some  of  their  serious  limitations. 
Section  3  presents  some  conjectures  about  how  humans 
may  characterize  such  object  classes,  and  Section  4 
discusses  how  objects  might  be  rapidly  recognized  using 
appropriate  parallel  hardware. 

2.  CLASSICAL  PARADIGMS  AND  THEIR 
LIMITATIONS 

In  this  section  we  briefly  review  classical  paradigms 
for  characterizing  (or  “modeling”)  classes  of  objects  and 
for  recognizing  an  object  as  belonging  to  a  given  class, 
and  indicate  why  these  paradigms  are  inadequate  for  our 
purposes. 

2.1.  Characterizing  objects 

The  traditional  approach  to  characterizing  a  class  of 
objects  is  to  regard  the  objects  as  composed  of  parts  that 
have  given  properties  and  are  related  in  given  ways.  The 
parts  can  in  turn  be  regarded  as  composed  of  subparts, 
etc.,  but  in  practice  the  hierarchy  of  parts,  subparts, ...  is 
not  very  deep.  In  this  paper  we  will  generally  ignore  the 
hierarchical  nature  of  object  parts.  We  will  also  consider 
primarily  geometric  properties  and  relations  of  the  parts, 
but  our  ideas  extend  straightforwardly  to  incorporate 
properties  and  relations  involving  shading,  color,  or  tex¬ 
ture. 

The  parts/properties/relations  approach  is  quite 
satisfactory  for  describing  manufactured  objects  com¬ 
posed  of  well-defined,  precisely  described  parts,  but  its 
limitations  become  apparent  if  we  try  to  apply  it  to  even 
the  simplest  classes  of  real-world  objects.  Here,  even  if 
the  parts  are  well-defined,  it  is  by  no  means  easy  to 
define  the  constraints  on  property  and  relation  values 


620 


that  characterize  a  given  object  class.  To  give  a  very 
simple,  two-dimensional  example,  suppose  the  objects  are 
alphanumeric  characters  composed  of  straight  line  seg¬ 
ments.  (Note  that  this  ignores  not  only  shading,  color, 
etc.,  but  also  stroke  thickness,  curvature,  gaps,  wiggles, 
serifs,  flourishes,  etc.)  Given  a  collection  of  three  line  seg¬ 
ments,  what  properties  and  relations  must  it  have  in 
order  to  be  recognized  as  a  block  capital  “A”?  The  field 
of  automatic  character  recognition  is  over  30  years  old, 
but  no  one  has  yet  succeeded  in  completely  formulating 
such  a  definition. 

The  difficulty  of  fully  characterizing  classes  of  real- 
world  objects  arises  from  two  sources: 

a)  Such  classes  generally  do  not  have  crisp  definitions; 
they  must  be  defined  “fuzzily”.  In  our  “A”  exam¬ 
ple,  some  configurations  of  three  line  segments  would 
be  recognized  by  all  observers  as  perfect  A’s,  others 
would  be  regarded  as  imperfect  or  distorted  A’s,  still 
others  would  be  rejected  by  some  observers,  and  so 
on.  (For  some  additional  remarks  on  the  reeogniza- 
bility  of  alphanumeric  characters  see  Section  5.) 
Thus  it  is  not  always  obvious  whether  a  given 
configuration  is  or  is  not  an  A;  we  can  at  best  assign 
it  a  degree  of  membership  in  the  class  of  A’s. 

b)  The  space  of  configurations  of  object  parts  has  many 
degrees  of  freedom,  even  if  the  parts  themselves  are 
simple  (here:  straight  line  segments).  Thus  to 
characterize  the  class  of  A’s,  we  must  define  its 
“membership  function"  over  a  high-dimensional 
space.  This  function  does  not  have  a  simple  form; 

the  property  and  relation  values  that  give  rise  to 

acceptable  A’s  are  highly  interdependent. 

Thus  characterizing  object  classes  can  be  impractical 
even  when  the  objects  are  two-dimensional  and  are  com¬ 
posed  of  well-defined  parts  (such  as  straight  line  seg¬ 
ments)  for  which  only  a  few  well-defined  properties  and 
relations  are  relevant.  It  becomes  even  less  practical 
when  the  parts,  properties,  and  relations  are  not 
mathematically  well  defined — e.g.,  what  are  the  parts 
that  comprise  a  script  capital  A,  and  how  are  the  shapes 
of  these  parts  characterized? 

2.2.  Recognizing  objects 

Suppose  we  have  somehow  succeeded  in  characteriz¬ 
ing  a  class  of  objects  in  terms  of  parts,  properties,  and 
relations.  Recognizing  that  an  object  belonging  to  the 
given  class  appears  in  an  image  may  still  be  very  difficult. 
The  standard  approach  to  object  recognition  involves 
finding  appropriate  parts  in  the  image  (i.e.,  segmenta¬ 
tion);  computing  properties  of  and  relations  among  these 
parts;  and  looking  for  a  configuration  of  parts  whose  pro¬ 
perty  and  relation  values  satisfy  the  constraints  that 
characterize  the  given  class.  It  is  well  known  that  each  of 
these  steps  involves  major  difficulties,  some  of  which  will 
now  be  discussed. 

a)  Finding  the  parts  in  the  image 

In  an  image  of  a  three-dimensional  object,  the  parts 
of  the  object  may  not  all  be  visible,  and  even  if  a  part  is 


visible,  the  image  shows  only  a  two-dimensional  projec¬ 
tion  of  it,  seen  from  a  viewpoint  that  is  not  known  a 
priori.  Thus  relating  image  parts  (regions,  region  boun¬ 
daries  (“edges”),  etc.)  to  object  parts  is  not  straightfor¬ 
ward. 

Even  if  the  object  is  two-dimensional  (e.g.,  an 
alphanumeric  character),  so  that  its  parts  are  all  visible 
in  the  image,  it  may  be  difficult  to  find  them  correctly. 
Finding  regions  or  region  boundaries  in  a  noisy  image  is 
difficult,  and  so  is  segmentation  of  a  region  or  boundary 
into  parts  based  on  geometric  criteria. 

Even  if  segmentation  can  be  performed  correctly, 
conventional  segmentation  techniques  are  generally  slow. 
For  example,  detecting  a  long  straight  line  segment  by 
conventional  methods  requires  a  number  of  computa¬ 
tional  steps  on  the  order  of  the  segment  length,  which 
may  be  hundreds  of  pixels.  Similar  remarks  apply  to  seg¬ 
mentation  tasks  involving  more  general  types  of  regions, 
region  boundaries,  or  curves, 
b)  Measuring  properties  and  relations 

For  a  three-dimensional  object,  the  properties  of  and 
relations  between  projected  images  of  object  parts  pro¬ 
vide  only  partial  information  about  the  three-dimensional 
properties  of  and  relations  between  the  corresponding 
object  parts,  and  thus  are  of  limited  value  in  object 
recognition. 

Even  if  the  object  is  two-dimensional,  many  of  the 
parts  found  in  the  image  will  be  “noise”;  for  example,  an 
object  part  may  be  represented  by  two  or  more  image 
regions,  or  a  region  may  represent  a  fusion  of  two  or 
more  object  parts.  Thus  the  properties  of  many  of  the 
image  parts  will  be  meaningless;  and  the  relations 
between  pairs  of  parts  are  even  less  likely  to  be  meaning¬ 
ful.  [For  example,  if  half  the  parts  are  incorrect,  three- 
quarters  of  the  pairs  of  parts  are  likely  to  be  incorrect.) 

Even  in  cases  where  the  properties  are  meaningful, 
conventional  techniques  for  computing  their  values  are 
relatively  slow.  For  example,  computing  arc  length  or 
region  area  usually  requires  a  number  of  computational 
steps  on  the  order  of  the  quantity  being  measured, 

c)  Finding  configurations  that  satisfy  the  con¬ 
straints 

Finding  sets  of  image  parts  that  could  represent  a 
given  three-dimensional  object  involves  a  process  of  con-  • 
straint  intersection.  For  each  image  part,  and  each 
object  part  that  could  have  given  rise  to  it,  the  position 
and  orientation  of  the  object  must  satisfy  certain 
geometric  constraints.  If  these  constraints  have  a 
nonempty  intersection  for  a  set  of  image  parts  and 
corresponding  object  parts,  we  have  evidence  that  the 
object  is  in  fact  present.  This  method  works  best  when 
the  geometry  of  the  object  is  accurately  Known  (e.g.,  it  is 
a  manufactured  object),  so  that  the  projections  of  the 
object  onto  the  image  plane  can  be  accurately  predicted. 
It  is  not  clear  how  well  the  method  works  if  the  object 
belongs  to  a  highly  variable  class  (e.g.,  it  is  an  animal  or 
plant  of  a  given  type). 


621 


Even  if  the  object  is  two-dimensional,  finding  it  in  a 
noisy  image  may  require  extensive  search  among  the 
image  parts,  especially  if  the  image  parts  do  not  always 
correspond  to  object  parts  due  to  segmentation  errors. 
The  problem  can  be  regarded  as  one  of  finding  an  iso¬ 
morphic  copy  of  a  given  (small)  labeled  graph  as  a  sub¬ 
graph  of  a  (large)  labeled  graph  (possibly  after  some 
merging  or  splitting  of  nodes);  here  the  graph  nodes  are 
the  image  parts,  the  node  labels  are  property  values,  and 
the  arc  labels  are  relation  values.  Techniques  for  finding 
given  subgraphs  in  noisy  graphs  have  been  developed,  but 
they  are  limited  in  the  amount  of  noise  they  can  handle. 

If  we  want  to  be  able  to  recognize  objects  of  many 
different  types  in  the  image,  the  classical  approach 
requires  us  to  search  for  them  one  by  one,  since  we  do 
not  know  in  advance  which  ones  may  be  present.  If  the 
number  of  object  types  is  large,  this  sequential  search 
process  is  too  slow.  In  principle  it  can  be  speeded  up  by 
hierarchically  “indexing”  the  objects  so  that,  by  applying 
tests  of  increasing  complexity,  we  can  successively  elim¬ 
inate  classes  of  objects;  but  it  is  not  clear  how  to  do  this 
in  practice  for  a  widely  diverse  collection  of  highly  vari¬ 
able  object  types. 

3.  SOME  CONJECTURES  ABOUT  OBJECT 
DESCRIPTION 

We  have  seen  that  if  the  traditional  paradigms  are 
used,  it  is  difficult  to  characterize  real-world  classes  of 
objects  explicitly  or  to  recognize  them  in  an  image. 
Humans,  however,  can  rapidly  and  reliably  recognize  such 
objects.  This  suggests  that  humans  use  simplified 
methods  of  characterizing  classes  of  objects,  designed  so 
that  the  constraints  defining  the  classes  can  be  quickly 
checked  when  an  image  is  presented. 

This  section  proposes  a  set  of  conjectures  about  the 
methods  that  humans  may  use,  when  performing  rapid 
object  recognition,  to  describe  objects  and  to  characterize 
classes  of  objects.  As  in  Section  2,  we  will  emphasize  the 
geometrical  level  of  description,  and  ignore  shading,  color, 
and  texture.  Many  facts  about  the  human  visual  system 
will  also  be  ignored,  for  simplicity;  for  example,  we  do 
not  take  into  account  the  falloff  of  resolution  away  from 
the  center  of  the  field  of  view,  nor  the  fact  that  proper¬ 
ties  are  often  not  computed  veridically  (illusions). 

Conjecture  I 

F or  purposes  of  rapid  recognition,  humans  represent  a 
three-dimensional  object  by  a  set  of  characteristic  views 
or  "aspects",  i.e.  by  a  set  of  commonly  occurring  two- 
dimensional  projections. 

In  effect,  this  reduces  the  problem  of  rapid  three- 
dimensional  object  recognition  to  a  set  of  two- 
dimensional  problems.  The  number  of  aspects  needed 
could  in  principle  be  quite  large,  but  ordinarily  object 
orientations  are  quite  constrained.  We  tend  to  visualize 
familiar  objects  as  seen  from  one  of  a  few  standard 


viewpoints;  probably  we  recognize  them  rapidly  only 
when  seen  from  approximately  those  viewpoints.  Recog¬ 
nition  from  other  viewpoints  may  require  “mental  rota¬ 
tion”,  which  is  reported  to  take  on  the  order  of  a  second 
or  longer. 

The  constraints  on  part  properties  and  relations  that 
characterize  an  object  must  be  quite  loose  if  the  object  is 
to  be  recognized  from  a  viewpoint  that  is  known  only 
approximately,  since  most  property  and  relation  values 
vary  under  perspective  transformations.  In  fact,  it  has 
been  found  that  the  human  observer  is  a  “sloppy  geome¬ 
ter”  [3],  and  “recognizes”  objects  even  if  their  representa¬ 
tions  in  an  image  are  not  geometrically  correct. 

The  parts  themselves  must  also  remain  fairly  stable 
under  changes  in  viewpoint.  When  a  change,  causes 
major  parts  to  appear  or  disappear,  merge  or  split,  it  has 
given  rise  to  a  new  aspect.  Minor  changes  give  rise  to 
“subaspects”;  hierarchical  indexing  of  the  aspects  would 
probably  be  a  useful  aid  to  rapid  recognition. 

Conjecture  2 

To  a  first  approximation,  humans  describe  the  image  of 
an  object  (as  seen  from  a  given  aspect)  as  consisting  of  a 
set  of  “ primitive ”  parts.  There  are  two  types  of  such 
parts:  pieces  of  regions  and  pieces  of  boundaries. 

A  primitive  piece  of  boundary  is  a  maximal  boun¬ 
dary  arc  having  a  simple  shape,  e.g.  straight  (i.e.,  a 
“side”),  convex,  concave,  etc.  (cf.  the  “codons”  of  [4]).  A 
primitive  piece  of  region  is  a  maximal  subregion  having 
approximate  central  symmetry  (i.e.,  a  blob),  or  a  simple¬ 
shaped  local  symmetry  axis  and  a  width  function  of  a 
simple  form  (i.e.,  a  ribbon;  cf.  the  “smoothed  local  sym¬ 
metries”  of  [5)).  We  will  not  try  to  precisely  define  the 
class  of  primitive  parts  here;  they  are  not  necessarily  the 
same  as  codons  or  smoothed  local  symmetries. 

We  conjecture  that  the  primitive  parts  are  the  per¬ 
ceptually  salient  parts  of  an  object,  in  the  sense  that 
object  descriptions  are  formulated  in  terms  of  the  proper¬ 
ties  of  and  relations  between  these  parts.  Note  that 
primitive  parts  are  not  always  unambiguously  defined;  it 
may  be  unclear,  for  example,  whether  a  nearly  straight 
arc  should  be  regarded  as  a  single  boundary  piece  or  as  a 
concatenation  of  two  pieces. 

Even  in  the  simplest  cases,  the  parts  themselves  have 
parts,  e.g.  endpoints  (for  example,  a  corner  is  an  endpoint 
of  two  boundary  segments  and  of  a  symmetry  axis  seg¬ 
ment),  and  these  subparts  too  are  perceptually  salient. 
Similarly,  a  blob  has  a  boundary,  and  a  ribbon  has 
“sides”  and  “ends”.  In  fact,  the  parts  are  not  truly 
primitive;  they  arc  consistent  configurations  of  subparts. 

Parts  can  be  defined  at  different  scales.  For  exam¬ 
ple,  a  wiggly  edge  is  regarded  as  consisting  of  many  short 
segments  at  a  fine  scale,  but  may  be  regarded  as  a  single 
“straight”  edge  at  a  coarser  scale.  Similarly,  a  symmetry 
axis  may  have  many  small  branches  at  a  fine  scale,  but 
may  be  regarded  as  a  simple  arc  at  a  coarser  scale;  or  a 


622 


string  of  small  blobs  may  form  a  dotted  arc  at  a  coarse 
scale.  The  parts  that  play  dominant  roles  in  the  descrip¬ 
tion  of  an  object  are  likely  to  be  those  that  are  defined  at 
relatively  large  scales,  comparable  to  the  object  size. 
However,  this  is  not  simply  a  matter  of  resolution;  small 
parts  are  perceptually  conspicuous  if  they  are  isolated. 

Conjecture  3 

The  properties  used  by  humans  to  describe  parts  for  pur¬ 
poses  of  rapid  recognition  are  local  property  values,  or 
simple  combinations  of  such  values. 

Such  properties  might  include  position  (of  an  end¬ 
point  or  a  centroid);  arc  length  (of  a  boundary  or  sym¬ 
metry  axis  segment);  average  radius  (of  a  blob)  or  width 
(of  a  ribbon);  area;  slope  (at  an  endpoint),  average  slope, 
or  slope  of  principal  axis;  average  absolute  curvature; 
number  of  local  extrema  or  zero-crossings  of  curvature; 
etc.  [We  will  not  discuss  here  the  nature  of  the 
coordinate  frame  with  respect  to  which  position  and  slope 
are  defined.]  Maxima  or  minima  rather  than  sums  or 
averages  can  also  be  useful;  for  example,  position  extrema 
define  a  "box”  within  which  the  part  is  contained.  This 
is  no*  meant  to  be  an  exhaustive  list,  but  it  indicates  the 
variety  of  global  properties  that  can  be  computed  by 
combining  local  property  values  in  simple  ways.  e.g.  by 
summation.  Other  properties  can  be  defined  as  combina¬ 
tions  of  these,  e.g.  the  "shape  factor”  of  a  blob 
(area/perimeter2)  or  the  elongatedness  of  a  ribbon 
(length/width).  Still  other  properties  of  parts  can  be 
defined  in  terms  of  the  absence  of  other,  undesired  parts; 
for  example,  a  boundary  ;s  smooth  if  it  has  no  corners. 

The  topological  property  of  connectedness  cannot  be 
defined  by  combining  local  property  values.  On  the  other 
hand,  it  is  well  known  that  the  connectedness  of  a  region 
is  not  perceived  immediately  unless  the  region  has  a 
sufficiently  simple  shape.  We  conjecture  that  non¬ 
connectedness  is  perceived  immediately  only  when  the 
parts  are  clearly  separated,  e.g.  when  they  are  contained 
in  disjoint  "boxes”. 


Conjecture  4 

The  relations  used  by  humans  to  describe  combinations  of 
parts  for  purposes  of  rapid  recognition  are  defined  in 
terms  of  relative  values  of  properties 

For  example,  we  can  specify  that  two  parts  are 
(approximately)  equal  (with  respect  to  a  particular  pro¬ 
perty  value),  that  one  is  greater  than  the  other,  etc. 
Note  that  since  position  is  a  property,  this  allows  rela¬ 
tions  of  relative  position,  such  as  distance,  as  well  as 
more  qualitative  relations  of  direction  and  distance  such 
as  above/below,  left/right,  near/far,  etc.  Relative  values 
of  relative  property  values  can  also  be  used,  e.g. 
nearer/farther,  which  involve  relative  distances,  or 
“between”,  which  involves  relative  directions. 

Some  relations  between  parts  cannot  easily  be 
expressed  in  terms  of  relative  property  values.  Examples 


are  adjacency  (of  two  regions),  tangency  (of  two  boun¬ 
daries),  or  crossing  (of  two  curves).  Note,  however,  that 
these  relations  can  usually  be  associated  with  the  pres¬ 
ence  or  absence  of  other  perceptually  salient  parts — e.g., 
if  two  regions  fail  to  be  adjacent  along  some  segment  of 
their  boundary,  a  ribbonlike  gap  must  exist  between 
them,  while  if  two  curves  cross  or  touch,  angles  are 
created  at  their  intersection  point.  We  conjecture  that 
relations  such  as  adjacency  are  not  perceived  immedi¬ 
ately,  and  that  nonadjacency  is  perceived  immediately 
only  when  it  gives  rise  to  a  perceptually  conspicuous  gap. 

The  topological  relation  of  surroundedness  (i.e.,  the 
fact  that  one  part  is  inside  another)  is  also  not  always 
perceived  immediately.  We  conjecture  that  non- 
surroundedness  is  perceived  immediately  only  when  there 
is  a  clear  line  of  sight  (e.g.,  when  some  perceptually 
salient  point  of  the  inner  object  has  no  point  of  the  outer 
object  to  its  right),  which  is  a  relation  of  relative  posi¬ 
tion. 

Conjecture  5 

Humans  characterize  a  class  of  objects  (as  seen  from  a 
given  aspect)  using  a  set  of  simple  unidimensional  con¬ 
straints  on  individual  property  and  relative  property 
values;  they  do  not  use  multidimensional  constraints. 

To  give  a  very  simple  illustration  of  this  idea,  we 
consider  a  block  capital  L,  since  it  is  even  simpler  than 
an  A.  A  configuration  of  two  straight  line  segments  looks 
like  an  L  if  one  is  approximately  vertical,  the  other 
approximately  horizontal,  and  the  lower  endpoint  of  the 
vertical  segment  approximately  coincides  with  the  left 
endpoint  of  the  horizontal  segment.  In  addition,  the  ratio 
of  the  lengths  of  the  segments  should  lie  in  the  proper 
range,  with  the  vertical  somewhat  longer  than  the  hor¬ 
izontal. 

Each  of  these  constraints  can  be  expressed  as  a  toler¬ 
ance  interval  around  an  ideal  property  or  relative  pro¬ 
perty  value;  or,  more  generally,  it  can  be  expressed  as  a 
membership  function  from  the  set  of  values  into  [0,1]. 
For  example,  the  membership  function  for  “approxi¬ 
mately  horizontal”  is  defined  on  the  property  of  slope  and 
has  a  peak  around  0°.  (The  exact  size  and  shape  of  this 
peak  need  not  concern  us  here.)  Similarly,  the  member¬ 
ship  function  for  the  ratio  of  the  segment  lengths 
(vertical: horizontal)  has  a  peak  in  some  range  above  1. 
The  membership  function  for  “approximately  coincides" 
can  also  be  defined  as  a  ratio,  where  the  numerator  is  the 
distance  between  the  endpoints  (i.e.,  the  difference 
between  their  positions)  and  the  denominator  is  the  sum 
(or  max)  of  the  segment  lengths.  [Note  that  for  some 
properties  (position,  slope)  it  is  appropriate  to  define 
“relative”  in  terms  of  differences,  while  for  others  (length) 
it  is  appropriate  to  use  ratios.] 

The  constraints  on  an  L  listed  above  are  not 
independent.  The  slopes  of  the  two  segments  cannot  sim¬ 
ply  be  chosen  independently,  with  one  approximately  hor¬ 
izontal  and  the  other  approximately  vertical;  in  addition. 


623 


the  slopes  must  be  such  that  the  segments  are  approxi¬ 
mately  perpendicular.  In  a  classical  pattern  recognition 
framework,  this  would  be  expressed  by  a  bivariate 
membership  function  (defined  on  the  pair  of  slopes)  of  an 
appropriate  form,  with  the  membership  value  dropping 
off  as  the  individual  slopes  depart  from  their  ideal  values 
(0°  and  90°)  and  also  as  their  difference  departs  from  90°. 
We  conjecture  that  humans  do  not  use  such  bivariate  con¬ 
straints,  but  rather  use  sets  of  redundant  univariate  con¬ 
straints.  In  our  example,  in  addition  to  the  constraints  of 
approximate  horizontality  and  verticality  on  the  indivi¬ 
dual  segments,  we  would  impose  a  constraint  of  approxi¬ 
mate  perpendicularity,  defined  by  a  membership  function 
on  the  difference  of  slopes,  peaked  around  90°.  We  con¬ 
jecture  that  these  three  constraints  are  treated  by 
humans  as  though  they  were  independent,  though  in  fact 
they  are  not;  and  that  overall  membership  is  measured  by 
“intersecting”  them  (i.e.,  by  taking  the  min  of  their 
membership  values).  Thus  in  characterizing  objects  we 
use  a  higher  number  of  dimensions  than  necessary  (here: 
three,  even  though  there  are  only  two  slopes);  we  treat 
the  dimensions  as  independent  even  though  they  are 
mathematically  related,  and  we  treat  them  all  alike  even 
though  some  of  them  can  be  regarded  as  more  basic  than 
others  (e.g.,  the  slopes  are  basic  dimensions,  while  their 
difference  is  a  “derived"  dimension).  We  do  not  map  the 
constraints  on  the  redundant  dimensions  into  a  subspace 
having  only  the  minimal  necessary  number  of  dimensions, 
as  would  be  done  in  the  classical  approach. 

The  constraints  used  in  our  example  have  all  been  of 
a  very  simple  form,  defined  by  membership  functions 
having  single  peaks.  We  conjecture  that  humans  gen¬ 
erally  use  only  such  unimodal  membership  functions.  If  a 
membership  function  is  strongly  bimodal,  humans  regard 
the  object  class  as  consisting  of  two  subclasses,  each  hav¬ 
ing  a  unimodal  membership.  Functions  of  complex  form 
are  not  used;  the  basic  form  of  a  peak  is  a  plateau  having 
simple-shaped  dropoffs. 

4.  AN  APPROACH  TO  OBJECT 
RECOGNITION 

We  have  conjectured  in  Section  3  that  for  purposes 
of  rapid  recognition,  humans  use  a  simple  class  of  object 
descriptions,  based  on  typical  aspects,  perceptually  salient 
parts,  and  simple  property  and  relative  property  values, 
and  that  they  characterize  object  types  by  imposing  sim¬ 
ple  univariate  constraints  on  these  values.  We  now  sug¬ 
gest  how  objects  characterized  in  this  way  can  be  rapidly 
recognized  using  suitable  parallel  computational  tech¬ 
niques. 

Our  approach  consists  of  three  stages: 

(1)  The  image  is  input  to  an  array  of  processors  hav¬ 
ing  an  appropriate  type  of  connectivity;  two 
extensively  used  examples  of  such  arrays  are  a 
hypercube  and  an  exponentially  tapering  pyramid. 
This  processor  array  segments  perceptually  salient 
parts  from  the  image  and  measures  their  prupert\ 


values  in  time  proportional  to  the  logarithm  of  the 
image  size  (the  dimension  of  the  hypercube  or  the 
height  of  the  pyramid). 

(2)  The  part  properties  are  broadcast  to  another  set 
of  processors  that  contain  object  characterizations. 

(3)  Each  of  these  “object  processors”  computes  the 
appropriate  relative  property  values  and  checks 
whether  the  constraints  defining  its  object  type 
are  satisfied. 

Further  details  about  these  stages  will  be  given  below. 
Note,  meanwhile,  that  at  stage  (1),  by  using  hypercube- 
or  pyramid-connected  hardware  operating  in  parallel,  seg¬ 
mentation  and  property  measurement  are  performed  very 
rapidly.  The  broadcasting  at  stage  (2)  is  basically 
sequential,  but  we  assume  that  the  properties  of  only  the 
most  conspicuous  image  parts  are  broadcast,  so  that  the 
total  amount  of  data  broadcast  is  limited.  Finally,  at 
stage  (3)  the  part  descriptions  (property  values)  extracted 
from  the  image  are  checked  in  parallel  against  the  entire 
set  of  object  characterizations.  [In  terms  of  subgraph  iso¬ 
morphism,  here  the  subgraphs  all  check  the  data 
extracted  from  the  graph,  in  parallel,  looking  for  copies  of 
themselves;  this  is  the  reverse  of  the  traditional  procedure 
in  which  the  graph  cheeks  itself  for  copies  of  the  sub¬ 
graphs.] 

4.1.  Part  segmentation  and  property  value  com¬ 
putation 

Segmentation  of  arbitrary  (complex-shaped)  regions 
or  boundaries  from  an  image  seems  to  require  computa¬ 
tion  time  on  the  order  of  the  region  diameter  or  boundary 
length,  even  if  implemented  on  parallel  hardware.  If  we 
restrict  ourselves  to  “primitive”  (i.e.,  simple-shaped) 
boundary  arcs  or  regions,  however,  such  as  sides,  blobs, 
or  ribbons,  divide-and-conquer  techniques  can  be  used  to 
segment  them  from  the  image  in  times  on  the  order  of 
the  log  of  the  diameter  or  length,  using  appropriate  paral¬ 
lel  hardware.  Techniques  for  performing  these  types  of 
fast  segmentation  operations  on  an  image  are  under 
intensive  study  in  our  laboratory;  for  a  recent  review  of 
this  work  see  [6). 

We  assume  that  the  image,  say  of  size  2BX2",  is 
input  to  a  square  array  of  processors  (“cells”,  for  short), 
one  pixel  per  cell.  In  a  hypercube,  each  cell  is  connected 
to  the  cells  in  its  row  and  column  of  the  array  at  dis¬ 
tances  1,  2,  4,  ,  2n_1  (modulo  2").  In  a  pyramid,  we 

have  n  additional  arrays  of  cells  of  sizes  2"'1  X  2"'1, 
2*-2  x  2"'2,  .  .  .  ,  2X2,  and  1X1,  one  above  the  other, 
and  each  cell  in  a  given  array  is  connected  to  a  block  of 
cells  in  the  array  below  it.  The  techniques  discussed  in 
(6j  assume  a  pyramid  of  cells,  but  they  have  straightfor¬ 
ward  analogs  in  the  hypercube  case. 

Our  techniques  segment  perceptually  salient  parts 
from  the  image  by  constructing  trees  of  pyramid  cells  in 
which  the  root  cell  of  a  tree  represents  the  given  part  and 
the  leaf  cells  (in  the  base  of  the  pyramid)  correspond  to 
the  pixels  that  belong  to  the  part.  Since  the  pyramid 
tapers  exponentially,  the  height  of  a  tree  (and  hence  the 


624 


time  needed  to  construct  it)  is  proportional  to  the  loga¬ 
rithm  of  the  size  of  the  part.  Parts  that  are  large  or  iso¬ 
lated  are  represented  by  roots  high  up  in  the  pyramid, 
while  small,  non-isolated  parts  correspond  to  roots  lower 
in  the  pyramid.  Thus  our  techniques  are  able  to  rapidly 
find  perceptually  salient  parts  in  the  image,  and  to 
represent  them  by  roots  whose  heights  in  the  pyramid  are 
related  to  their  conspicuousness. 

We  will  not  discuss  here  in  detail  the  types  of  parts 
that  can  be  segmented  from  an  image  using  these  tech¬ 
niques,  but  it  is  suggested  in  [6]  that  they  include  “group¬ 
ings"  of  pixels  (having  given  local  properties)  that  satisfy 
the  Gestalt  “laws”  of  similarity,  proximity,  good  con¬ 
tinuation,  and  closure.  It  is  widely  agreed  among 
psychologists  that  in  human  perception,  such  parts  are 
generated  by  “preattentive”  processes,  and  are  then  used 
in  the  process  of  object  recognition.  (The  recognition 
process  too  must  be  regarded  as  “preattentive”  when  we 
are  dealing  with  unexpected  objects  and  are  able  to 
recognize  them  “at  a  glance”.  On  the  other  hand,  the 
grouping  processes  are  probably  innate,  whereas  charac¬ 
terizations  of  objects  must  be  learned.] 

If  an  image  part  is  represented  by  a  tree  of  pyramid 
cells,  properties  of  the  part  defined  by  sums,  maxima,  etc. 
of  local  property  values  can  be  quickly  computed  by  the 
cells  in  the  tree,  using  simple  divide-and-conquer  tech¬ 
niques,  so  that  the  results  are  available  at  the  root  of  the 
tree  in  time  proportional  to  the  tree  height,  i.e.  to  the 
logarithm  of  the  part  size. 

A  cell  high  up  in  the  pyramid  obtains  input  data 
from  a  large  block  of  the  image.  Such  a  block  may  con¬ 
tain  or  intersect  many  image  parts;  many  blob-like  parts 
may  be  contained  in  it,  and  many  arc-like  or  ribbon-like 
parts  may  pass  through  it.  We  assume  that  the  capacity 
of  a  cell  to  represent  (pieces  of)  parts  is  limited,  and  that 
if  this  capacity  is  exceeded,  the  cell  preserves  only  statist¬ 
ical  information  about  the  properties  of  the  set  of  parts 
that  its  image  block  contains.  We  further  assume  that 
the  cell  tries  to  group  its  set  of  parts  into  subsets  having 
unimodally  distributed  property  values.  In  particular,  if 
one  of  the  parts  differs  sufficiently  from  all  the  others,  the 
cell  preserves  information  about  the  unique  part  while 
statistically  summarizing  the  data  about  the  others. 
Thus  unique  image  parts,  even  if  they  are  not  isolated, 
are  represented  by  root  cells  relatively  high  in  the 
pyramid. 

4.2.  Broadcasting 

We  postulate  a  “readout”  process  that  scans  the 
upper  levels  of  the  pyramid  and  broadcasts  the  set  of  pro¬ 
perty  values  (or  statistical  summaries  of  property  values) 
stored  at  each  cell  to  the  object  processors  (i.e.,  the  pro¬ 
cessors  that  contain  characterizations  of  objects). 
According  to  the  principles  sketched  in  Section  4.1,  the 
cells  on  the  upper  levels  of  the  pyramid  contain  statistical 
summaries  of  the  properties  of  small  image  parts,  as  well 
as  properties  of  individual  parts  that  are  large,  isolated, 
or  unique.  These  latter  are  the  parts  that  are  perceptu¬ 


ally  conspicuous,  i.e.  whose  presence  in  the  image  is 
notieed  “immediately”. 

Our  assumption  that  the  readout  process  starts  with 
the  upper  levels  of  the  pyramid  applies  to  situations  in 
which  there  are  no  prior  expectations.  Presumably,  prior 
knowledge  can  be  used  to  control  the  order  of  readout,  as 
regards  types  of  parts,  sizes,  positions,  etc.;  but  goal- 
directed  processing  will  not  be  discussed  in  this  paper. 

4.3.  Constraint  checking 

When  an  object  processor  receives  the  broadcast 
data,  it  computes  (for  each  possible  association  of  the 
image  parts  with  object  parts)  the  necessary  relative  pro¬ 
perty  values  and  checks  whether  the  appropriate  con¬ 
straints  are  satisfied.  Note  that  relative  property  values 
are  not  computed  in  the  pyramid  and  are  not  broadcast; 
this  saves  much  broadcasting  time,  and  each  object  pro¬ 
cessor  computes  only  those  relative  values  that  it  needs. 
Relations  between  parts  defined  by  the  presence  of  other 
parts  (e.g.,  perceptually  conspicuous  gaps)  are  also  deter¬ 
mined  at  this  stage.  (Relations  based  on  the  absence  of 
other  parts  cannot  be  reliably  determined,  since  the  other 
parts  may  be  present  but  their  properties  may  not  have 
been  broadcast.) 

We  do  not  consider  here  how  the  object  processors 
are  related  to  one  another  or  how  they  are  intercon¬ 
nected.  One  can  imagine  that  they  are  organized  in  some 
type  of  whole/part  hierarchical  structure  in  which  objects 
can  have  subobjects  in  common,  so  that  the  subobjects 
need  only  be  processed  once.  On  the  other  hand,  a 
hierarchical  organization  based  on  object  classes  (e.g., 
“dog”  and  “horse”  are  subclasses  of  “mammal”)  is  less 
relevant  to  our  object  recognition  task;  the  parts  seg¬ 
mented  from  an  image  may  look  like  parts  of  a  (specific 
type  of)  dog  or  horse,  but  it  is  not  clear  what  the  parts  of 
a  general  mammal  look  like.  The  organization  of  our 
object  processors,  which  are  concerned  with  iconic 
descriptions  of  objects,  constitutes  only  one  aspect  of  the 
organization  of  semantic  memory,  which  also  involves  the 
names  of  the  objects,  their  class  relations  and 

associations,  their  functional  roles,  and  so  on. 

Since  the  image  is  usually  noisy,  many  of  the  parts 
segmented  from  the  image  will  not  correspond  to  object 
parts.  Thus  even  if  an  object  is  present,  its  description  is 
not  likely  to  be  more  than  partially  satisfied  by  the  image 
data.  We  assume  that  an  object  is  recognized  at  first 
glance  only  if  many  of  its  parts  are  visible  in  the  image, 
but  this  is  not  the  case  for  any  other  object — i.e.,  the 
configuration  of  parts  visible  in  the  image  is  distinctive. 
(Recall  that  we  are  dealing  with  images  showing  only  a 
single  object  on  a  blank  background.) 

In  some  cases,  an  object  can  be  recognized  based  on 
seeing  only  a  few  of  its  parts,  if  they  are  sufficiently  dis¬ 
tinctive;  this  capability  is  exploited  by  caricaturists.  The 
effectiveness  of  our  approach  depends  on  the  fact  that  our 
pyramid  techniques  can  rapidly  find  global  image  parts 
that  have  rich,  distinctive  descriptions;  the  parts  are  not 
merely  local  features,  which  could  be  rapidly  extracted 


625 


using  conventional  techniques. 

If  enough  parts  are  visible,  we  recognize  an  object 
even  though  extraneous  parts  may  also  be  present;  for 
example,  we  can  recognize  an  object  even  when  someone 
has  scribbled  on  the  image.  On  the  other  hand,  if  the 
additional  parts  interfere  with  the  segmentation  of  the 
correct  parts — for  example,  if  the  lines  belonging  to  the 
object  are  smoothly  extended — the  object  becomes  very 
hard  to  detect. 

We  recognize  an  object  even  though  its  parts  may 
also  belong  to  (or  look  like)  other  objects.  This  seems  to 
cause  no  problem  for  rapid  recognition;  the  rivalrous  per¬ 
ceptions  can  be  resolved  in  subsequent  analysis. 

Once  an  object  is  recognized,  it  becomes  possible  to 
reanalyze  the  image  in  a  goal-directed  fashion.  This 
allows  us,  for  example,  to  modify  the  segmentation  in  an 
attempt  to  find  missing  parts  (by  changing  the  segmenta¬ 
tion  criteria  or  by  merging  or  splitting  previously  found 
parts),  to  resolve  inconsistencies,  or  to  examine  less  per¬ 
ceptually  salient  parts  which  represent  finer  details  of  the 
object.  Discussion  of  this  goal-directed  analysis  process  is 
beyond  the  scope  of  this  paper.  [It  should  be  pointed  out 
that  goal-directed  analysis  may  play  a  role  even  in  the 
perception  of  unfamiliar  objects.  If  such  an  object  is 
“recognized”  as  belonging  to  a  given  general  class  (e.g., 
the  class  of  polygons,  of  smooth  blobs,  etc.),  goal-directed 
methods  can  then  be  used  to  confirm  this  description, 
supply  missing  details,  etc.) 

5.  DISCUSSION  AND  DISCLAIMERS 

The  purpose  of  this  paper  was  to  suggest  a  computa¬ 
tional  scheme  that  might  account  for  human  performance 
in  recognizing  familiar,  but  unexpected  objects  “at  a 
glance",  i.e.  in  on  the  order  of  100  neural  “cycles”. 
Assuming  that  an  object  subtends  at  least  several  degrees 
of  visual  angle,  its  retinal  image  is  hundreds  of  receptor 
cells  across,  so  that  conventional  techniques  of  segmenta¬ 
tion  and  property  measurement  would  require  hundreds 
of  computational  steps,  even  if  implemented  by  parallel 
processing,  e.g.  on  a  mesh-connected  computer.  Our 
pyramid-  (or  hypercube-)  based  approach  should  reduce 
this  time  substantially,  since  the  pyramid  height  is  only 
about  10,  and  we  can  use  divide-and-conquer  techniques 
to  find  the  needed  image  parts  and  compute  their  needed 
properties  in  tens  of  (parallel)  steps.  Only  a  limited 
amount  of  property  data  need  be  broadcast  to  the  object 
processors,  which  then  compute  the  needed  relative  pro¬ 
perty  values  and  carry  out  constraint  checking  in  parallel. 

Our  concepts  of  perceptual  saliency  for  parts,  pro¬ 
perties,  and  relations  have  been  based  on  introspection; 
we  have  not  considered  analysis  techniques  based  on 
processes  that  do  not  seem  to  be  subject  to  introspection, 
e.g.  processes  involving  image  transforms.  We  have 
assumed  (at  least  tacitly)  that  the  parts  are  extracted  in 
a  “pyramid  space”  derived  from  the  image  by  resolution 
reduction,  rather  than  in  a  transform  domain.  We  have 
also  assumed  that  the  object  processors  are  localized, 


rather  than  constituting  some  type  of  distributed 
memory. 

We  have  not  intended  to  imply  that  there  is  a  clear 
dichotomy  between  recognition  at  a  glance  and  recogni¬ 
tion  based  on  deliberate  inspection.  Some  objects  will 
take  longer  to  recognize  because  they  are  more  easily  con¬ 
fused  with  other  objects  or  because  the  image  is  noisier. 
In  terms  of  our  proposed  approach,  it  may  sometimes  be 
necessaiy  to  obtain  more  data  from  the  image  (from 
lower  pyramid  levels,  e.g.)  before  a  decision  can  be  made, 
or  it  may  be  necessary  to  perform  some  goal-directed  re¬ 
analysis  in  an  attempt  to  decide  among  alternative  or 
rivalrous  objects.  If  the  situation  is  noisy  enough  or 
ambiguous  enough,  prolonged  inspection  may  be  neces¬ 
sary.  [We  have  not  discussed  the  noise-resistant  proper¬ 
ties  of  our  approach,  nor  how  it  might  be  implemented 
using  unreliable  components.)  Our  approach  is  intended 
only  to  demonstrate  how  recognition  might  take  place 
very  rapidly  in  a  sufficiently  clear-cut  situation. 

At  the  other  extreme,  if  an  object  is  viewed  for  a 
very  short  period  of  time  (e.g.,  tens  of  milliseconds),  it  is 
usually  not  possible  to  recognize  it,  but  it  may  be  possible 
to  get  a  general  impression  of  its  appearance,  e.g.  that  it 
is  compact  or  that  it  is  vertically  elongated.  In  terms  of 
our  approach,  this  may  result  from  lack  of  sufficient  time 
to  broadcast  the  properties  of  any  but  the  largest  parts, 
so  that  only  crude  shape  information  is  available. 

Recognition  in  the  absence  of  prior  expectations  or 
context  is  a  very  special  task;  normally  one  does  have 
expectations,  and  objects  do  not  occur  in  isolation.  How¬ 
ever,  this  special  task  is  an  important  one.  The  visual 
system  is  occasionally  confronted  with  surprises,  and 
must  handle  them  rapidly  and  accurately;  it  cannot 
always  depend  on  expectations.  It  should  be  recognized, 
however,  that  there  are  many  possible  levels  of  expecta¬ 
tion;  for  example,  in  some  situations  one  might  be  expect¬ 
ing  a  certain  general  class  of  objects.  In  fact,  some  types 
of  tacit  “expectations”  may  always  be  present  as  a  result 
of  cultural  factors. 

The  absence  of  expectations,  or  wrong  expectations, 
can  greatly  increase  the  difficulty  of  recognition.  For 
example,  if  we  expect  alphanumeric  characters,  we  can 
easily  interpret  odd  configurations  of  lines  as  characters, 
and  can  rapidly  distinguish  characters  from  one  another; 
but  if  we  have  no  prior  expectations,  or  if  we  are  expect¬ 
ing  pictures  of  animals,  the  configurations  may  make  no 
sense.  As  an  extreme  case  of  this,  many  oddly  designed 
typefaces  are  readable  because  we  expect  letters,  and  can 
tell  (by  elimination  or  from  context)  which  letter  is 
intended;  but  many  of  these  letters  could  not  be  correctly 
identified  in  the  absence  of  such  context.  Similarly,  a 
number  of  authors  have  illustrated  the  complexity  of  the 
task  of  recognizing  alphanumeric  characters  by  demon¬ 
strating  that  a  very  wide  variety  of  patterns  can  all  be 
recognized  as  (e.g.)  A’s;  but  many  of  these  patterns  would 
in  fact  not  be  recognized  as  A’s  if  we  did  not  know  that 
they  were  intended  to  be  A’s — in  other  words,  they  are 
examples  of  goal-directed  recognition  rather  than 


626 


recognition  at  a  glance. 

Humans  have  a  remarkable  capacity  for  recognizing 
images  that  they  have  previously  seen.  [Hard-to-segment 
images  also  become  much  easier  to  interpret  on  subse¬ 
quent  occasions,  once  they  have  been  successfully  seg¬ 
mented.)  This  does  not  necessarily  involve  storing  and 
matching  of  iconic  templates;  it  may  be  based  on  storage 
and  matching  of  specific  sets  of  relative  property  values. 
In  any  case,  rapid  recognition  of  unexpected  familiar 
objects  does  not  depend  on  having  seen  the  same  objects 
previously;  recognition  is  still  immediate  even  for 
instances  of  the  objects  that  we  have  never  seen  before. 

In  conclusion,  it  should  be  emphasized  that  this 
paper  has  not  presented  a  report  of  research  results,  but 
only  an  outline  of  a  proposed  research  program.  Some  of 
the  ideas  described  are  under  active  investigation  (e.g., 
the  pyramid  approach  to  segmentation),  but  others  are 
still  at  the  level  of  speculations  (e.g.,  the  conjectures 
about  object  characterization  and  the  parallel  object 
recognition  process).  It  is  hoped  that  the  publication  of 
this  paper  will  stimulate  exploration  of  (and  improvement 
on)  these  ideas  both  in  our  laboratory  and  elsewhere. 

REFERENCES 

1.  T.  Pavlidis,  Structural  Pattern  Recognition , 

Springer,  Berlin,  1977.  Section  6.2. 

2.  A.  Rosenfeld  and  A.C.  Kak,  Digital  Picture  Process¬ 
ing  (second  edition).  Academic  Press,  New  York, 
1982,  Section  12.2.4. 

3.  D.N.  Perkins,  Why  the  human  perceiver  is  a  bad 
machine,  in  J.  Beck  et  al..  eds..  Human  and  Machine 
Vision,  Academic  Press,  New  York,  1983,  341-364. 

4.  W.  Richards  and  D.D.  Hoffman,  Codon  constraints 
on  closed  2D  shapes,  Computer  Vision.  Graphics. 
Image  Processing  31,  1985,  265-281. 

5.  M.  Brady  and  M.  Asada,  Smoothed  local  symmetries 
and  their  implementation.  International  J  Robotics 
Research  3  (3),  1984,  36-61. 

6.  A.  Rosenfeld,  Some  pyramid  techniques  for  image 
segmentation,  Technical  Report  1664,  Center  for 
Automation  Research,  University  of  Maryland,  May 
1986. 


627 


PARALLEL  ALGORITHMS  FOR  COMPUTER  VISION  ON  THE  CONNECTION  MACHINE 


James  J.  Little,  Guy  Blelloch,  and  Todd  Cass 

Massachusetts  Institute  of  Technology 
Artificial  Intelligence  Laboratory 


Abstract 

We  show  how  to  solve  a  set  of  benchmark  problems  for 
Computer  Vision  for  the  Connection  Machine.  These  prob¬ 
lems  are  intended  to  comprise  a  representative  sample  of 
fundamental  procedures  for  Image  Understanding.  The  so¬ 
lutions  on  the  Connection  Machine  embody  general  meth¬ 
ods  for  filtering  images,  determining  connectivity  among 
image  elements,  and  determining  geometry  of  image  ele¬ 
ments.  We  identify  a  set  of  low  and  intermediate  vision 
modules,  including  edge  detection,  from  zero-crossings  of 
V1  *  G  and  from  the  Canny  detector,  connected  component 
labeling,  Hough  transforms,  and  element  visibility.  More 
general  problems  such  as  graph  matching  and  shortest  path 
calculation  are  also  included.  The  implementation  of  these 
generic  modules  demonstrates  the  ability  of  the  Connec¬ 
tion  Machine  to  solve  vision  problems  easily  and  rapidly, 
using  a  variety  of  interesting  aspects  of  the  communication 
structure  of  the  machine. 


1  Introduction 

The  Connection  Machine  is  a  powerful  fine-grained  parallel 
machine  which  has  already  proven  very  useful  for  imple¬ 
mentation  of  vision  algorithms.  Among  the  existing  vision 
systems  on  the  Connection  Machine  are  binocular  stereo 
[DrumheIIer86]  and  optical  flowdetection[Little86b].  In  im¬ 
plementing  these  systems,  several  different  models  of  using 
the  Connection  Machine  have  emerged,  since  the  machine 
provides  several  different  communication  modes.  The  Con¬ 
nection  Machine  implementation  of  algorithms  can  take  ad¬ 
vantage  of  the  underlying  architecture  of  the  machine  in 
novel  ways.  We  describe  here  several  common,  elementary 
operations  which  recur  throughout  this  discussion  of  par¬ 
allel  algorithms. 

1.1  The  Connection  Machine 

The  Connection  Machine  [Hillis85]  is  a  parallel  computing 
machine  having  between  16K  and  64 K  processors,  operat¬ 
ing  under  a  single  instruction  stream  broadcast  to  all  pro¬ 


cessors  (figure  1).  It  is  a  Single  Instruction  Multiple  Data 
(SIMD)  machine,  because  all  processors  execute  the  same 
control  stream.  Each  of  the  processors  is  a  simple  1-bit  pro¬ 
cessor,  currently  with  4K  bits  of  memory.  There  are  two 
modes  of  communication  among  the  processors:  first,  the 
processors  are  connected  by  a  mesh  of  wires  into  a  128x512 
grid  network  (the  NEWS  network,  so-called  because  of  the 
four  cardinal  directions),  allowing  rapid  direct  communi¬ 
cation  between  neighboring  processors,  and,  second,  the 
router,  which  allows  messages  to  be  sent  from  any  proces¬ 
sor  to  any  other  processor  in  the  machine.  The  processors 
in  the  Connection  Machine  can  be  envisioned  as  being  the 
vertices  of  a  16-dimensional  hypercube  (in  fact,  it  is  a  12- 
dimensional  hypercube;  at  each  vertex  of  the  hypercube 
resides  a  chip  containing  16  processors).  Figure  2  shows 
a  4-dimensional  hypercube;  each  processor  is  connected  by 
4  wires  to  other  processors.  Each  processor  in  the  Con¬ 
nection  Machine  is  identified  by  a  unique  integer  in  the 
range  0. . .  65535,  its  hypercube  address,  imposing  a  linear 


Figure  1:  Block  Diagram  of  the  Connection  Machine 


Figure  2:  4-dimensional  Hypercube 


order  on  the  processors.  This  address  denotes  the  desti¬ 
nation  of  messages  handled  by  the  router.  Messages  pass 
along  the  edges  of  the  hypercube  from  source  processors 
to  destination  processors.  In  addition  to  local  operations 
in  the  processors,  the  Connection  Machine  has  facilities  for 
returning  to  the  host  machine  the  result  of  various  opera¬ 
tions  on  a  field  in  all  processors;  it  can  return  the  global 
maximum,  minimum,  sum,  logical  AND,  logical  OR  of  the 
field. 

To  allow  the  machine  to  manipulate  data  structures 
with  more  than  64K  elements,  the  Connection  Machine 
supports  the  concept  of  virtual  processors.  A  single  physi¬ 
cal  processor  can  operate  as  multiple  virtual  processors  by 
serializing  operations  in  time,  and  dividing  the  memory  of 
each  processor  accordingly.  This  is  otherwise  invisible  to 
the  user.  The  number  of  virtual  processors  assigned  to  a 
physical  processor  is  denoted  by  the  virtual  processor  ra¬ 
tio  (VP  ratio),  which  is  always  >  1.  When  the  VP  ratio 
is  greater  than  1,  the  Connection  Machine  is  necessarily 
slowed  down  by  that  factor,  in  most  operations.  The  host 
for  the  16K  Connection  Machine  at  the  MIT  AI  Lab  is  a 
Symbolics  3640  Lisp  Machine.  In  this  paper,  we  assume  a 
64K  Connection  Machine.  Connection  Machine  programs 
utilize  Common  Lisp  syntax,  in  a  language  called  *Lisp 
[Lasser86].  Statements  in  *Lisp  programs  are  compiled 
and  manipulated  in  the  same  fashion  as  Lisp  statements, 
contributing  significantly  to  the  ease  of  programming  the 
Connection  Machine. 

1.2  Powerful  Primitive  Operations 

Many  vision  problems  must  be  solved  by  a  combination 
of  several  communication  modes  on  the  Connection  Ma¬ 
chine.  The  design  of  these  algorithms  takes  advantage  of 
the  underlying  architecture  of  the  machine  in  novel  ways. 
There  are  several  common,  elementary  operations  which  re¬ 


cur  throughout  this  discussion  of  parallel  algorithms:  rout¬ 
ing  operations,  scanning  and  distance  doubling. 

1.2.1  Routing 

Memory  in  the  Connection  Machine  is  attached  to  proces¬ 
sors,  each  of  which  has,  at  present,  4K  bits  of  local  mem¬ 
ory.  Local  memory  can  be  accessed  rapidly.  Memory  of 
processors  nearby  in  the  NEWS  network  can  be  accessed 
by  passing  it  through  the  processors  on  the  path  between 
the  source  and  destination.  At  present,  NEWS  accesses  in 
the  machine  are  made  in  the  same  direction  for  all  pro¬ 
cessors,  i.e.,  to  read  in  one  processor,  using  NEWS  access, 
say,  the  North,  and,  in  the  next,  the  South,  requires  two 
separate  accesses.  The  router  on  the  Connection  Machine 
provides  parallel  reads  and  writes  among  processor  mem¬ 
ory  at  arbitrary  distances  and  with  arbitrary  patterns.  It 
uses  a  packet-switched  message  routing  scheme  to  direct 
messages  along  the  hypercube  connections  to  their  desti¬ 
nations.  This  powerful  communication  mode  can  be  used 
to  reconfigure  completely,  in  one  paral'»>  write  operation, 
taking  one  router  cycle,  a  field  of  information  in  the  ma¬ 
chine.  The  Connection  Machine  supplies  instructions  so 
that  processors  can  concurrently  read  and  concurrently 
write  a  memory  address,  but  since  these  memory  refer¬ 
ences  can  cause  significant  slowdown,  we  will  usually  only 
consider  exclusive  read,  exclusive  write  (EREW)  [Cook80] 
instructions.  We  will  usually  not  allow  more  than  one  pro¬ 
cessor  to  excess  the  memory  of  another  processor  at  one 
time.  However,  in  the  Hough  Transform  we  will  take  advan¬ 
tage  of  combiners  since  we  are  guaranteed  few  and  evenly 
spaced  collisions.  The  Connection  Machine  can  combine 
messages  at  a  destination,  by  various  operations,  such  as 
logical  AND,  inclusive  OR,  summation,  and  maximum  or 
minimum. 

1.2.2  Scanning 

A  powerful  set  of  primitive  operations  are  the  scan  op¬ 
erations  [Blelloch86].  These  operations  can  be  used  to  sim¬ 
plify  and  speed  up  many  algorithms.  They  directly  take  ad¬ 
vantage  of  the  hypercube  connections  underlying  the  router 
and  can  be  used  to  distribute  values  among  the  processors 
and  to  aggregate  values  using  associative  operators.  For- 
maly,  the  scan  operation  takes  a  binary  associative  operator 


processor-number  • 

[0 

1 

2 

3 

4 

6 

6  7] 

A 

[6 

1 

3 

4 

3 

9 

2  6] 

Plus -Sc an (A)  » 

[0 

6 

6 

9 

13 

16 

18  21] 

Max-Scan(A)  » 

CO 

6 

6 

6 

6 

6 

9  9] 

Figure  3:  Examples  of  Plus-Scan  and  Max-Scan. 


629 


processor-number 

[0 

1 

2 

3 

4 

5 

6 

7] 

A 

z 

[5 

1 

3 

4 

3 

9 

2 

6] 

SB  (segment  bit) 

* 

[1 

0 

1 

0 

0 

0 

1 

0] 

Plus-Scan(A,  SB) 

> 

to 

5 

6 

3 

7 

10 

19 

2] 

Max-Scan(A,  SB) 

m 

[0 

5 

5 

3 

4 

4 

9 

2] 

Min-Scan(A,  SB) 

- 

[MX  5 

1 

3 

3 

3 

3 

2] 

Copy-Scan(A,  SB) 

« 

[0 

5 

5 

3 

3 

3 

3 

2] 

Figure  4:  Examples  of  Segmented  Sean  Operations, 

MX  is  the  Maximum  Possible  Value. 

©,  with  identity  0,  and  an  ordered  set  (a0, ai, . . . ,  an_i],  and 
returns  the  set  [0,  a0,  (ao  ©  ai),  •  ■ . ,  (a0  ©  ax  ©  . . .  ©  a„_j)]. 
This  operation  is  sometimes  refered  to  as  the  data  inde¬ 
pendent  prefix  operation  [Kruskal85],  Binary  associative 
operators  include  minimum,  maximum,  and  plus.  Figure  3 
shows  the  results  of  scans  using  the  operators  maximum 
and  plus  on  some  example  data. 

The  four  scans  operations  plus-scan,  max-scan,  min- 
scan  and  copy-scan  are  all  implemented  in  microcode  and 
take  about  the  same  amount  of  time  as  a  routing  cycle.  The 
copy-scan  operation  takes  a  value  at  the  first  processor  and 
distributes  it  over  the  other  processors.  These  scans  op¬ 
erations  can  take  segment  bits  that  divide  the  processor 
ordering  into  segments.  The  begining  of  each  segment  is 
marked  by  a  processor  whose  segment  bit  is  set,  and  the 
scan  operations  will  start  over  again  at  the  begining  of  each 
segment.  Figure  4  shows  some  examples. 

Versions  of  the  scan  operations  also  work  using  the 
NEWS  addressing  scheme  -  we  will  call  these  grid-scans. 
These  allow  one  to  sum,  take  the  maximum,  copy,  or  num¬ 
ber  values  along  rows  or  columns  of  the  NEWS  grid  quickly. 

For  example,  grid-scans  can  be  used  to  quickly  find  for 
each  pixel  in  an  image  the  sum  of  a  2m  x  2m  square  region 
centered  at  the  pixel.  This  sum  can  be  determined  for  all 
pixels  by  executing  a  plus-scan  along  the  rows.  Each  pixel 
then  gets  the  result  of  the  scan  from  the  processor  m/2  in 
front  of  it  and  m/2  behind  it,  and  by  subtracting  these  two 
values,  each  pixel  has  the  sum  in  its  neighborhood  along 
the  row.  We  now  executes  the  same  calculation  on  the 
columns,  giving  each  element  the  sum  of  its  square.  The 
whole  calculation  only  requires  a  few  scons  and  routing 
operations  and  runs  in  time  independent  of  the  size  of  m. 

We  will  see  several  other  uses  of  the  scan  operations 
later  in  the  paper. 

1.2.3  Distance  Doubling 

Another  important  primitive  operation  is  distance  doubling 
[Wyllie79][Lim86],  which  can  be  used  to  compute  the  effect 
of  any  binary,  associative  operation,  as  in  scon,  on  proces¬ 
sors  linked  in  a  list  or  a  ring.  For  example,  using  max, 
doubling  can  find  the  extremum  of  a  field  contained  in  the 


processors.  Using  message-passing  on  the  router,  doubling 
can  propagate  the  extreme  value  to  all  processors  in  the 
ring  in  0(log  N)  steps,  where  N  is  the  number  of  proces¬ 
sors  in  the  ring.  Each  step  involves  two  send  operations. 
Typically,  the  value  to  be  maximized  is  chosen  to  be  the 
cube-address  (a  unique  integer  identifier)  of  the  processor. 
At  termination,  each  processor  connected  in  the  ring  knows 
the  label  of  the  maximum  processor  in  the  ring,  hereafter 
termed  the  principal  processor.  This  serves  to  label  all  con¬ 
nected  processors  uniquely  and  to  nominate  a  particular 
processor  (the  principal)  as  the  representative  for  the  en¬ 
tire  set  of  connected  processors.  At  the  same  time,  the 
distance  from  the  principal  can  be  computed  in  each  pro¬ 
cessor.  Figure  5  shows  the  propagation  of  values  in  a  ring 
of  eight  processors.  Each  processor  initially,  at  step  0,  has 
an  address  of  the  next  processor  in  the  ring,  and  a  value 
which  is  to  be  maximized.  At  the  termination  of  the  i,h 
step,  a  processor  knows  the  addresses  of  processors  2'  -I-  1 
away  and  the  maximum  of  all  values  within  2*-1  proces¬ 
sors  away.  In  the  example,  the  maximum  value  has  been 
propagated  to  all  8  processors  in  log  8  =  3  steps. 


Figure  5:  Distance  Doubling,  upper  entry  is  the  address, 
the  lower  is  the  value 


2  Vision  Tasks 

We  catalog  a  set  of  vision  tasks  common  to  many  ap¬ 
proaches  to  early  and  intermediate  vision.  Early  vision 
algorithms  involve  many  operations  having  spatial  paral¬ 
lelism,  where  all  pixels  are  operated  on  in  unison,  in  a 
spatially  homogeneous  computation.  The  classic  example 
is  filtering  or  convolution.  A  large  class  of  regularization 
computations  can  be  cast  in  this  framework,  when  the  in¬ 
put  data  lie  on  a  regular  mesh.  These  spatially  parallel 
computations  are  easily  mapped  into  operations  using  only 
NEWS  accesses.  When  regular,  but  not  spatially  homo¬ 
geneous  operation  is  necessary,  operations  can  be  used  to 
produce  simple,  efficient  algorithms.  Notably,  scans  in  grid 
directions  can  also  be  used  to  make  more  efficient  some  spa¬ 
tially  homogeneous  computations  such  as  region  summing, 
described  above.  Finally,  tasks  requiring  arbitrary,  diverse 
communication  among  elements  can  utilize  the  router’s 


power.  So,  we  see  there  are  natural  classes  of  operations 
which  map  easily  into  the  different  communication  modes 
of  the  Connection  Machine.  We  will  first  describe  edge 
detection,  which  in  most  aspects  is  a  spatially  parallel  op¬ 
eration,  but  as  we  shall  see,  not  all.  In  these  algorithms,  an 
initial  mapping  of  processors  to  pixels  in  the  image  array 
is  assumed. 

3  Edge  Detection 

Edge  detection  is  an  important  first  step  in  many  low-level 
vision  algorithms.  It  generates  a  concise,  compact  descrip¬ 
tion  of  the  structure  of  the  image,  suitable  for  manipulation 
in  higher-level  interpretation  tasks.  There  are  many  edge 
detectors,  in  the  literature,  but  all  share  common  aspects 
of  spatial  parallelism.  Here  we  will  analyze  the  Connection 
Machine  implementation  of  Marr-Hildreth  edge  detection 
and  Canny  edge  detection  schemes. 

3.1  Filtering 

A  fundamental  operation  in  vision  processing  is  filtering 
the  input  image.  This  both  removes  noise  from  the  image 
and  selects  an  appropriate  spatial  scale.  Typically,  filtering 
is  accomplished  by  convolution  with  a  filter  of  bounded 
spatial  extent,  often  a  Gaussian.  We  have  implemented  a 
variety  of  methods  for  computing  the  Gaussian  convolution 
of  an  image. 

An  interesting,  simple  implementation  of  Gaussian  con¬ 
volution  relies  on  the  binomial  approximation  to  the  Gaus¬ 
sian  distribution.  The  algorithm  described  here  requires 
only  the  operations  of  integer  addition,  shifting,  and  local 
communication  on  the  2-D  mesh.  Since  it  does  not  re¬ 
quire  global  broadcasts,  large  memory,  or  more  than  the 
simple  network  connections,  it  could  be  implemented  on  a 
2-D  mesh  architecture  much  simpler  than  the  Connection 
Machine. 

To  derive  the  binomial  approximation,  consider  the 
discrete  signal  z[0]  =  1/2,  x[l]  =  1/2,  and  x[n]  =  0 
otherwise.  The  Z  transform  [Oppenheim75]  of  this  se¬ 
quence  is  (1  +  z)/ 2.  Convolution  of  the  two  discrete  sig¬ 
nals  Xi[n]  and  xj[n]  corresponds  to  multiplication  of  their 
respective  Z  transforms  -Xi(z)  and  Xi(z).  The  result  of 
the  convolution  xi[n]  *  xj[n]  *  ...  *  xm(n]  has  Z  transform 
X(«)m  =  (l/2)m(l  +  z)m  where  z,[n]  =  x[n)  for  all  and  * 
denotes  convolution.  Thus  the  sequence  resuling  from  m—  1 
convolutions  of  x[n]  with  itself  is  the  inverse  Z  transform  of 
(l/2)m(l  4-  z)m.  So  the  resulting  discrete  signal  is  just  the 
binomial  coefficients  to  (1  +  z)m  scaled  by  (l/2)m.  From 
the  Central  Limit  Theorem,  for  large  m,  this  sequence  ap¬ 
proaches  a  discrete  Gaussian. 

With  each  convolution,  the  peak  of  the  result  shifts  to 
the  right  by  one.  By  advancing  the  kernel  x[n]  in  space 
by  one,  x'[n]  =  x[n  +  1],  and  convolving  once  with  x[n], 
followed  by  once  with  x'[nj,  the  peak  of  the  result  re¬ 


mains  at  n  =  0.  For  an  even  number  of  such  alterna¬ 
tions  this  is  equivalent  to  half  as  many  convolutions  with 
the  kernel  y[nj  =  x'[n]  *  xjn),  and  so  y[n]  is  given  by: 
y[— 1]  =  1/4, y[0]  =  1/2,  y[l]  =  1/4,  y(n]  =  0  otherwise. 

This  operations  is  very  simple  to  implement  on  a  1  bit 
processor  such  as  the  Connection  Machine.  Such  ?.  convo¬ 
lution  amounts  to  a  processor  adding  1  /2  of  its  own  value 
to  1/4  of  the  sum  of  two  of  its  neighbors,  e.g.  north  and 
south,  or  east  and  west.  The  scaling  can  be  accomplished 
simply  by  right  shifting,  or  possibly  reading  from  the  neigh¬ 
boring  processors  with  a  one  bit  offset  in  the  field  address. 
Because  a  2-D  Gausssian  convolution  is  separable  into  two 
Id  convolutions,  this  can  be  used  to  implement  an  approx¬ 
imation  to  convolution  of  a  2-D  signal  with  a  2-D  Gaussian 
filter. 

Since  the  variance  of  the  sequence  y[n]  is  1/2,  the  vari¬ 
ance  of  the  result  of  convolving  yjn]  with  itself  u  times  is 
simply  (m  +  l)/2.  Thus  to  approximate  a  Gaussian  distri¬ 
bution  with  standard  deviation  a,  m  =  2(ojJ  —  1  iterations 
are  needed. 

This  approach  is  especially  suited  to  the  fine-grained 
architecture  of  the  Connection  Machine,  where,  at  present, 
a  multiplication  is  much  more  expensive  than  an  addition. 
Convolution  with  a  Gaussian  filter  can  also  be  approxi¬ 
mated  by  iterated  convolution  with  a  uniform  (boxcar)  fil¬ 
ter  of  width  N  and  height  l/N.  To  approximate  a  Gaussian 
with  standard  deviation  a, 

12oJ  -  1 
N1-  1 

iterations  are  required.  This  approximation  is  useful  when 
a  is  large. 

The  customary  implementation  of  the  two-dimensional 
Gaussian  utilizes  the  fact  that  the  Gaussian  is  separable, 
and  uses  two  one-dimensional  convolutions.  Because  this 
requires  multiplications,  the  cost  must  be  analyzed  in  terms 
of  them:  16o  multiplies,  8o  additions,  and  8o  NEWS  ac¬ 
cesses.  The  present  implementation  of  separable  convolu¬ 
tion  should  become  more  efficient  than  binomial  approxi¬ 
mation  with  the  simple  kernel  at  a  >  10.  Using  additional 
storage,  (4o  words),  and  more  clever  programming  could 
reduce  the  tim*'  r  r  separable  convolution.  There  are  many 
other  t*"  ’  j  in  implementing  simple  filtering  operations: 
for  r  .  .c,  boxcar  filtering  can  be  implemented  using  the 
summing  method  based  on  scanning,  described  above,  giv¬ 
ing  dramatic  improvements  in  speed.  Further  experience 
will  show  us  how  to  make  these  choices. 

To  implement  V*  *  G(o),  one  often  chooses  to  approx¬ 
imate  by  the  Difference  of  Gaussians  [Marr  and  Hildreth, 
1977],  G(oi)  -  G(<Tj),(Ti  <  a  <  Oj.  Medioni  and  Huertas 
[Huertas85]  show  how  to  implement  VJ  *  G  directly  as  two 
separable  convolutions,  which  is  more  efficient  than  two 
convolutions  with  Gaussians  as  required  by  the  Difference 
of  Gaussians.  We  directly  compute  V*  *  G  by  convolution 
with  a  Gaussian  followed  by  convolution  with  a  discrete 
Laplacian.  This  is  mathematically  valid,  but  is  not  often 


631 


used,  presumably  since  it  requires  maintaining  extra  pre¬ 
cision  in  the  output  of  the  Gaussian  convolution.  Because 
the  Connection  Machine  does  not  have  word  or  byte  bound¬ 
aries  in  its  local  processor  memory,  it  is  relatively  easy  to 
implement  extended  precision  arithmetic.  Therefore  it  is 
easy  to  compute  V*  *  G(<x)  by  convolving  with  G(<t)  ,  re¬ 
taining  extra  precision,  and  then  filtering  with  the  discrete 
Laplacian: 

1  4  X 

4  -20  4 

1  4  1 

To  detect  zero-crossings,  each  processor  need  only  ex¬ 
amine  the  sign  bits  of  neighboring  processors,  using  NEWS 
accesses. 

3.2  Border  Following 

Border  following  begins  to  use  the  more  flexible  communi¬ 
cation  abilities  of  the  router  of  the  Connection  Machine.  To 
analyze  this  task,  we  consider  two  parameters,  n,  the  num¬ 
ber  of  curves  in  the  image,  and  m,  the  number  of  pixels  on 
the  longest  curve.  Each  pixel  in  the  Connection  Machine 
can  link  up  with  the  neighbor  pixels  in  the  curve,  by  exam¬ 
ining  its  8-neighbors  in  the  grid,  in  negligible  time.  Higher 
vision  modules  need  a  unique  tag  or  pointer  to  refer  to  a 
curve  as  a  unit;  this  is  essentially  the  problem  of  comput¬ 
ing  connected  components.  Further,  that  label  should  be 
known  by  all  the  pixels  in  the  curve.  So,  each  pixel  on 
the  curve  must  next  be  labeled  with  a  unique  identifier  for 
the  curve.  Doubling  permits  the  pixels  on  the  curve  to  se¬ 
lect  a  label,  the  address  of  the  principal  processor,  for  the 
curve,  and  to  propagate  that  label  throughout  the  curve  in 
O(logm)  steps. 

Then,  the  total  number  of  curves  can  be  computed  by 
selecting  the  principal  processors,  and  enumerating  them 
using  a  scan  operation.  The  enumerate  operation  returns 
the  number  of  curves  (n). 

3.3  Reconfiguring  for  Output 

At  this  point,  the  curves  have  been  linked,  labeled  uniquely, 
and  counted.  The  structure  constructed  so  far  is  sufficient 
to  support  most  operations  on  curves  for  image  understand¬ 
ing,  so  we  can  consider  all  processing  after  this  to  be  for 
output  only.  This  goes  against  our  guideline  of  keeping  all 
information  in  the  Connection  Machine,  but  is  an  interest¬ 
ing  example  of  restructuring  data  in  the  machine.  To  out¬ 
put  the  pixels  from  the  Connection  Machine,  the  points  on 
the  curves  should  be  numbered  in  order  to  create  a  stream 
of  connected  points.  The  curve-labelling  step,  using  dou¬ 
bling,  can  be  augmented  to  record  the  distance  from  the 
principal  processor,  as  well  as  its  label,  during  label  prop¬ 
agation,  at  only  a  slight  increase  in  message  length.  We 
can  find  the  length  of  the  longest  curve,  m,  by  one  global 
maximum  operation.  We  use  sorting  to  get  the  points  on 
the  curves  into  a  stream  order  for  output.  For  the  «'**  point 


in  the  jtk  curve,  we  construct  the  number  mj  + 1  to  encode 
the  point’s  position  on  the  curve  and  its  membership  in 
the  jth  curve.  This  yields  a  unique  index  for  each  point. 
Each  point  is  ranked  by  this  value.  Points  of  the  j,h  end  up 
ranked  in  order  of  their  position  on  the  curve,  in  the  time 
needed  to  sort  a  field  of  0(logm  -I-  logn  +  1)  bits. 

The  ordered  pixels  then  send  their  (x,y)  values  to  the 
address  given  by  the  rank, in  one  operation,  with  no  colli¬ 
sions.  The  (x,y)  coordinates  of  the  pixels  on  the  curve  will 
be  in  sequential  order  in  the  processors  by  cube  address. 
The  computation  needed  for  border  following  depends  on 
logm,  the  length  of  the  longest  curve,  and  logn,  the  num¬ 
ber  of  curves  in  the  image;  typically,  both  are  proportional 
to  n,  in  an  n  x  n  image.  Labelling  and  indexing  points, 
and  enumerating  curves  are  necessary  to  construct  curves 
out  of  individual  pixels.  Constructing  a  reconfigured  list  of 
the  coordinates  of  edge  points  is  only  necessary  for  output. 

4  Canny  Edge  Detection 

The  Canny  edge  detector  is  often  used  in  image  under- 
standing[Canny86].  It  is  based  on  directional  derivatives, 
so  it  has  improved  localization.  Implementing  the  Canny 
edge  detector  on  the  Connection  Machine  involves  imple¬ 
menting; 

1.  Gaussian  Smoothing  G*I 

2.  Directional  Derivative  \/(G  *  I) 

3.  Non-Maximum  Suppression 

4.  Thresholding  with  Hysteresis 

Gaussian  filtering  and  computing  directional  derivatives 
are  local  operations  as  described  above.  Non-maximum 
suppression  selects  as  edges  candidates,  those  pixels  for 
which  the  gradient  magnitude  is  maximal  in  the  direction  of 
the  gradient.  This  involves  interpolating  the  gradient  mag¬ 
nitude  between  each  of  two  pairs  of  adjacent  pixels  amoung 
the  pixels  eight  nieghbors,  one  forward  in  the  gradient  di¬ 
rection,  one  backward.  This  is  accomplished  in  constant 
time  using  the  NEWS  network. 

Thresholding  with  hysteresis  is  used  to  eliminate  weak 
edges  due  to  noise.  Two  thresholds  on  gradient  magni¬ 
tudes  are  computed,  low  and  high,  based  on  an  estimate  of 
the  noise  in  the  image  intensity.  The  non-maximum  sup¬ 
pression  step  has  selected  those  pixels  for  which  the  gradi¬ 
ent  magnitude  is  maximal  in  the  direction  of  the  gradient. 
In  the  thresholding  step,  all  selected  pixels  with  gradient 
magnitude  below  low  are  eliminated.  All  pixels  with  values 
above  high  are  considered  as  edges.  All  pixels  with  values 
between  low  and  high  are  considered  edges  if  they  can  be 
connected  to  a  pixel  above  high  through  a  chain  of  pixels 
above  low.  All  others  are  eliminated.  This  is  essentially 
a  spreading  activation  operation;  it  requires  propagating 
information  along  curves. 


632 


4.1  Histogramming 

Estimating  the  gradient  magnitude  distribution  can  be  per¬ 
formed  by  histogramming  that  value.  There  are  several 
ways  this  can  be  implemented  on  the  Connection  Machine 

Gradient  magnitudes  can  be  quantized  for  the  his¬ 
togram  bucket  size,  for  m  buckets.  Sorting  the  gradient 
magnitudes  configures  the  data  thus: 

.  ..k,k,k,k,k  +  l,k  +  l,fc  +  1 ... 

Each  processor  determines  whether  it  is  a  lower  boundary 
between  elements  with  value  k  and  the  elements  with  value 
A:  +  1,  for  all  k.  A  boundary  processor  sends  its  cube  ad¬ 
dress  to  location  Ht,  in  the  histogram  table,  resulting  in 
a  cumulative  frequency  distribution  table.  To  convert  this 
table  into  a  histogram,  each  Hi  subtracts  /?,•_  4,  or  the  ap¬ 
propriate  lower  value  when  i  was  not  represented  in  the 
image.  A  copy-scan  can  fill  in  empty  elements  to  simplify 
this  differencing  operation. 

Histogramming  in  m  buckets  by  counting  involves  step¬ 
ping  from  0  <  i  <  m,  selecting  processors  where  intensity 
=  i,  and  counting  the  number  of  selected  processors  (  Hi), 
using  global  counting  operations.  This  needs  m  counting 
operations.  When  m  is  less  than  64,  this  can  be  more  effi¬ 
cient  than  histogramming  by  sorting. 

Finally,  when  only  one  value  in  the  distribution  is 
needed,  say,  finding  the  k,k  percentile,  estimation  via  prob¬ 
ing  can  be  used.  Here,  a  binary  search  on  a  set  of  m  values 
is  performed  in  O(logm)  global  counting  operations,  until 
the  value  0  <  t  <  m  is  found  for  which  k  percent  of  the 
values  are  <  i.  For  m  =  256,  this  requires  only  8  counting 
operations.  A  more  reliable  estimate  of  the  distribution 
involves  computing  the  entire  histogram,  but  probing  may 
suffice  for  many  applications. 

Computing  a  Hough  Transform,  which  will  be  discussed 
later,  is  a  form  of  histogramming.  There,  we  will  be  able  to 
take  advantage  of  a  priori  information  on  the  distribution 
of  values  to  devise  a  fast  algorithm.  When  the  form  of  the 
distribution  of  image  values  is  known,  similar  methods  can 
be  applied  for  general  histogramming. 

4.2  Propagation 

In  thresholding  with  hysteresis,  the  existence  of  a  high  value 
on  a  connected  curve  must  be  propagated  along  the  curve, 
to  enable  any  low  pixels  to  become  edge  pixels.  This  re¬ 
quires  a  number  of  operations  depending  on  the  length  of 
the  longest  segment  above  low  which  will  be  connected  to 
a  high  pixel.  Only  pixels  above  low,  which  survive  non¬ 
maximum  suppression,  are  considered.  Each  pixel  can,  in 
constant  time,  find  the  one  or  two  neighboring  pixels  with 
which  it  forms  a  connected  line,  by  examining  the  state  of 
all  8  neighbors  in  the  NEWS  network.  All  pixels  above  high 
are  marked  as  edge  pixels.  In  the  current  implementation, 
the  program  iterates,  in  each  step  marking  as  edge  pixels 


any  low  pixels  adjacent  to  edge  pixels.  This  uses  NEWS 
connections.  When  no  pixel  changes  state,  the  iteration 
terminates,  taking  a  number  of  steps  proportional  to  the 
length  m  of  the  longest  chain  of  low  pixels  which  even¬ 
tually  become  edge  pixels.  Using  doubling,  propagating  a 
bit  to  indicate  the  edge  property,  changes  this  dependence 
to  O(logm).  For  most  practical  examples,  propagating  in 
the  NEWS  network  is  faster  than  using  the  asymptotically 
optimal  doubling  procedure. 

5  Connected  Component  Labeling 

A  fast  practical  alogrithm  for  labeling  connected  compo¬ 
nents  in  2-D  image  arrays  using  the  Connection  Machine 
has  been  developed  by  Lim  et  al.  [Lim86],  The  algorithm 
has  a  time  complexity  of  O(logW)  where  N  is  the  number 
of  pixels.  The  central  idea  in  the  algorithm  is  that  prop¬ 
agating  the  largest  or  smallest  number  stored  in  a  linked 
list  of  processors  to  all  processors  in  the  list  takes  O(log  L) 
time,  where  L  is  the  length  of  the  list,  using  doubling. 

In  the  algorithm  (see  [Lim86]  for  more  details),  the  la¬ 
bel  of  a  connected  (4-connected)  component  is  the  largest 
processor  address  (i.e.  processor  id)  of  the  processors  in 
the  set.  The  complexity  of  the  algorithm  is  measured  in 
terms  of  N,  the  length  of  the  longest  boundary  in  the  im¬ 
age.  The  whole  algorithm  takes  0(log  N)  routing  cycles  on 
the  Connection  Machine. 

We  have  devised  a  connected  component  algorithm  uti¬ 
lizing  scan  operations  along  grid-lines.  In  each  phase  of  his 
algorithm,  the  label  of  a  region,  as  specified  by  the  proces¬ 
sor  with  maximum  cube-address,  is  propagated  left,  right, 
up  and  down,  with  a  maz-scon  operation.  The  number  of 
phases  of  this  algorithm  depends  on  the  complexity  and 
the  alignment  of  figures  in  the  image.  Its  worst-case  be¬ 
havior  originates  from  an  image  containing  long  ellipsoidal 
regions,  oriented  along  diagonals. 

6  Hough  Transform 

The  Hough  transform  is  frequently  used  in  image  analy¬ 
sis  to  determine  the  existence  of  straight  lines  in  an  image 
[Ballard  and  Brown, 1981].  The  Generalized  Hough  Trans¬ 
form  similarly  is  used  to  determine  the  parameters  of  the 
position  and  orientation  of  a  known  object  from  an  im¬ 
age.  The  Hough  Transform  is  an  archetypical  algorithm 
in  that  it  demonstrates  the  accumulation  of  global  infor¬ 
mation  about  the  image  from  individual  elements  in  the 
image.  The  Hough  transform  computes  an  accumulator 
array  of  values,  where  the  (i,j)  entry  records  the  number 
of  pixels  lying  on  a  line  with  parameters  (i,j). 

Let  us  parametrize  lines  by  the  angle  of  the  normal  vec¬ 
tor,  i,  and  the  perpendicular  distance  from  the  origin,  j. 
The  Hough  Transform  table  will  be  stored  in  a  matrix  of 
processors,  indexed  by  (i,j).  Computing  the  Hough  Trans- 


633 


form  involves » separate  operations,  each  of  which  compute, 
the  Hough  Transform  for  a  particular  angle,  •(*').  For  each 
angle,  broadcast  eoatf(i)  and  smtf(t)  to  each  of  the  proces¬ 
sors.  Each  processor  then  computes  the  scalar  product  of 
its  (x,y)  address  in  the  grid  with  the  normal  vector  de¬ 
scribed  by  the  broadcast  pair: 

jf  =  xcosfl(»)  +  ysintf(t) 

Each  processor  then  knows  an  (*,  j)  for  itself.  We  can  count 
votes  by  sending  a  1  from  each  active  pixel,  summing  at  the 
destination  processor.  If  we  were  simply  to  send  each  vote 
to  the  appropriate  destination  in  the  table,  there  would  be 
many  collisions,  messages  arriving  at  one  processor,  espe¬ 
cially  when  (»,  j)  does  in  fact  represent  a  line.  This  can 
be  very  slow  when  the  number  of  collisions  is  large  (>  64). 
To  avoid  this,  we  can  use  a  clever  trick,  suggested  by  Mike 
Drumheller  (personal  communication),  to  reconfigure  the 
processors. 

In  the  reconfiguration  method,  each  pixel  computes  its 
location  on  a  linearization  of  the  processors  in  the  machine 
by  lines  normal  to  the  specified  angle  (see  figure  6).  Bold 
lines  mark  the  normal  lines.  The  pixels  on  the  first  normal 
line  are  sent  to  positions  0 . . .  PO,  those  on  the  second  line 
go  to  P0+ 1 . . .  PI,  and  so  on  up  to  N  —  1.  Each  pixel  then 
has  a  unique  address,  sequential  along  the  normal  lines,  in 
the  machine.  Each  pixel  can  send  its  value  to  the  processor 
with  its  number,  in  one  router  cycle  (there  are  no  colli¬ 
sions).  The  pixels  then  lie,  in  linear  order  in  the  machine, 
according  to  their  position  on  the  normal  lines.  A  boundary 
processor  is  one  which  occurs  at  the  beginning  of  one  of  the 
normal  lines;  it  sets  a  segment-bit.  Then  a  special  plus-scan 
operation,  using  segments,  can  accumulate  the  numbers  for 
the  histogram  in  the  boundary  processors.  One  send  opera¬ 
tion  can  collect  the  values  into  the  histogram.  This  suffices 
to  construct  a  column  (»,  j) ,  0  <  j  <  max,  of  the  histogram. 
Each  angle  requires  some  computation  to 

1)  compute  the  scalar  product 

2)  compute  an  address  along  scan  lines 

One  send,  followed  by  a  scan,  followed  by  a  send  com¬ 
pletes  the  process  for  a  column.  The  procedure  describe 
here  uses  unique  addresses  for  the  linearization  step.  The 
routing  hardware  incurs  little  penalty  for  having  up  to  32 
collisions  per  destination.  A  randomizing  strategy  is  thus 
feasible,  provided  the  number  of  collision  is  minimized,  and 
easier  to  program.  Each  pixel  computes  a  random  value  at 
the  start.  For  each  a  pixel  computes  a  location  based  on 
j  and  the  random  value,  using  a  linearization  along  nor¬ 
mal  lines  similar  to  the  deterministic  procedure.  Locations 
for  all  pixels  with  the  same  j  are  confined  to  a  contiguous 
block  of  processors  in  the  Connection  Machine.  The  sizes 
of  the  blocks  can  be  calculated  to  allow  enough  space  to 
reduce  the  average  number  of  collisions  close  to  zero.  The 
votes,  when  they  arrive,  are  summed,  using  a  built-in  ca¬ 
pability  of  the  router.  Then  the  votes  are  tallied  as  before 
using  a  special  pius-sean  operation,  with  segments.  This 


will  achieve  the  same  result,  probably  with  less  overhead 
for  computing  addresses. 

If  the  Hough  Transform  were  derived  from  oriented  edge 
fragments,  rather  than  pixels,  only  one  iteration  would  be 
needed.  Each  processor  can  generate  the  (i,j)  value  from 
its  (x,  y)  location  and  the  edge  orientation  8.  Each  proces¬ 
sor  then  sends  its  vote,  with  the  :add  combiner  to  (i,j). 
The  maximum  number  of  processors  in  a  256x256  image 
voting  for  the  same  (»',/)  would  be  no  more  than  256\/2, 
which  would  take  no  more  than  one  step  of  the  previous 
Hough  Transform  algorithm. 


7  Geometrical  constructions 

4  variety  of  geometrical  algorithms  were  designed  for  the 
Connection  Machine,  including  several  convex  hull  algo¬ 
rithms  and  a  Voronoi  Diagram  algorithm. 

7.1  Convex  Hull 

There  are  four  planar  convex-hull  algorithms  that  have 
been  suggested  for  the  Connection  Machine.  As  with  serial 
convex-hull  algorithms,  which  one  is  the  most  practical  will 
depend  on  the  specifics  of  the  problem.  We  will  describe 
each  one  briefly.  All  of  these  algorithms  for  m  points  re¬ 
quire  0{n)  processors. 

The  first  algorithm  is  based  on  a  Concurrent  Read 
Exclusive  Write  (CREW)  PRAM  algorithm  described  in 
[Atallah85,  Agrawall85j.  Since  the  Connection  Machine 
does  not  perform  well  on  concurrent  reads,  we  have  shown 
elsewhere  that  their  algorithm  can  be  modified  to  work  with 
exclusive  reads  and  scans  [Blelloch87].  This  algorithm  for  n 
points  has  the  optimal  asymptotic  time  bound  of  O(lgn). 
Although  this  algorithm  has  the  best  asymptotic  bound 


634 


of  the  algorithms  we  mention,  it  has  a  large  constant  and 
therefore  might  not  be  the  most  practical  algorithm. 

The  second  is  a  parallel  version  of  the  QUICKHULL 
method  [Preparata85j.  This  algorithm  can  be  implemented 
using  a  small  constant  number  of  router  cycles  and  scans 
per  step  [Blelloch87].  As  with  the  serial  QUICKHULL,  for 
m  hull  points,  this  algorithm  runs  in  O(lgm)  steps  for  well 
distributed  hull  points,  and  has  a  worst  case  running  time 
of  0(m). 

The  third  is  a  parallel  version  of  Graham’s  sequential 
algorithm  [Preparata85]  and  requires  0(logJ  n)  routing  cy¬ 
cles  for  n  points  (Little86aj.  The  final  algorithm  is  a  parallel 
version  of  the  Jarvis  march  algorithm  [Preparata85],  requir¬ 
ing  only  a  few  arithmetic  operations,  plus  a  global  maxi¬ 
mum  computation  per  hull  point  [Little86a].  Although  this 
algorithm  requires  O(m)  steps  for  m  hull  points,  it  might 
be  the  most  practical  for  some  applications  since  the  con¬ 
stant  is  small  and  the  number  of  points  on  the  hull  is  often 
small. 

7.2  Voronoi  Diagrams 

Aggarwal  et  al.  [Aggarwal85]  describe  a  O(logs  N)  al¬ 
gorithm  for  computing  the  Voronoi  diagram  of  N  points 
in  parallel  using  the  CREW  (Concurrent  Read  Exclusive 
Write)  model.  For  N  =  1000,  this  is  1000  steps,  each  of 
which  needs  at  least  one  routing  cycle.  Since  the  Connec¬ 
tion  Machine  has  the  NEWS  network,  a  set  of  mesh  con¬ 
nections  among  the  processors,  a  brush-fire  method  can  be 
easily  implemented.  One  can  argue  that  in  many  vision  ap¬ 
plications  the  coordinates  of  the  points  are  restricted  to  the 
range  of  the  resolution  of  the  camera  coordinate  system,  in 
which  case  working  at  image  resolution  is  acceptable. 

In  the  proposed  algorithm,  each  pixel  in  the  mesh  is 
labeled  with  the  (t,j)  location  of  the  point  closest  to  its 
(x,  y)  coordinate,  which  we  may  take  to  be  its  center.  The 
result  of  this  labeling  is  accurate  only  at  the  grid-spacing; 
each  grid  intersection  is  labeled  with  its  closest  point.  For 
the  Ly  and  Lx  metrics,  propagation  is  simple.  By  using 
4-connected  or  8-connected  neighbors,  we  can  make  the 
arrival  time  of  a  propagated  label  be  identical  to  its  Ly 
or  La,  distance.  The  Euclidean-distance  metric  is  more 
problematic,  and  we  must  propagate  the  (x,  y)  of  a  point 
as  well  as  its  index.  The  Voronoi  region  around  a  point 
can  be  labelled  in  D  steps,  where  D  is  the  diameter  of  the 
largest  Voronoi  region. 

The  Delaunay  triangulation  is  the  dual  of  the  graph  of 
the  Voronoi  diagram.  We  say  a  pixel  is  and  tdgt  pxztl  if 
any  neighbors  do  not  have  the  same  label.  To  generate 
the  adjacency  information  for  the  vertices  in  the  Delaunay 
triangulation,  label  each  edge  pixel  by  the  indices  of  the 
points  whose  regions  it  bounds;  for  edge  connecting  i  to  j, 
generate,  for  i  <  j,  i  *  N  +  j.  This  generates  a  number 
of  log1  N  bits,  for  N  points.  This  number  is  a  compressed 
representation  for  the  edge.  Sort  these  numbers,  and  then 
eliminate  all  those  whose  lower  neighbor  (by  hypercube  ad¬ 


dress)  is  the  same.  This  eliminates  multiple  entries  for  each 
edge.  Then,  each  edge  in  the  Voronoi  Diagram  has  a  unique 
processor  identified  with  it;  this  can  rapidly  be  transformed 
into  a  useful  graph  representation  on  the  Connection  Ma¬ 
chine,  in  which  the  edges  adjacent  to  a  vertex  are  stored  in 
a  contiguous  block  of  processors,  and  edges  («,  jj  contain 
the  address  of  the  processor  dedicated  to  Each  edge 
is  thus  stored  twice. 

Propagation  is  the  most  costly  operation  in  this  algo¬ 
rithm,  and  depends  of  the  diamater  of  the  largest  Voronoi 
region.  Trial  examples  with  1000  randomly  distributed 
points  had  average  maximum  diameter  approximately  12. 
The  additional  computation  to  label  edges  and  generate 
the  graph  representation  will  take  less  than  one  propa¬ 
gation  cycle.  The  result  of  this  implementation  is  only 
roughly  correct;  certain  thin  Voronoi  regions  can  cause  er¬ 
rors.  However,  these  may  not  be  of  any  practical  conse¬ 
quence.  Charles  Leiserson  and  Cynthia  Phillips  (personal 
communication)  have  developed  a  simple  extension  to  this 
algorithm  which  corrects  these  faults. 

8  Triangle  Visibility 

Certain  systems  in  vision  need  to  determine  visiblity  of 
scene  elements.  It  is  interesting  to  consider  the  perfor¬ 
mance  of  the  Connection  Machine  on  such  a  problem,  even 
though  the  problem  is  essentially  in  the  domain  of  com¬ 
puter  graphics.  For  example,  let  the  input  be  a  set  of 
m  opaque  triangles  in  three-dimensional  space,  selected  at 
random  with  each  coordinate  in  some  range.  Find  the  ver¬ 
tices  of  the  triangles  that  are  visible  from  the  origin. 

The  problem  can  be  parallelized  over  the  set  of  triangles 
or  the  set  of  vertices.  Even  though  there  are  three  times  as 
many  vertices,  the  description  of  a  triangle  is  larger,  and 
the  increase  in  communication  costs  overcomes  the  reduced 
number  of  elements.  A  triangle  shadows  all  vertices  which 
lie  in  the  triangular  cone  formed  by  the  origin  and  the  edges 
of  the  triangle,  and  which  are  behind  the  plane  containing 
the  triangle.  The  volume  in  space  defined  by  this  criterion 
is  described  by  4  linear  inequalities,  from  the  bounding 
half-spaces.  Each  triangle,  in  a  preprocessing  step,  gener¬ 
ates  the  four  plane  equations.  A  vertex  can  then  be  tested 
for  visibility  by  evaluating  these  equations  for  its  (x,y)  co¬ 
ordinates. 

The  following  formulation  uses  multiple  copies  of  the 
triangles.  The  problem  can  be  parallelized  by  copying  the 
triangles  k  times  in  the  memory  (64K)  of  the  Connection 
Machine,  where 

k  =  65536/m 

This  divides  the  machine  into  k  subsets  of  processors.  Each 
triangle  processor  will  handle  up  n  =  (c«7my(3m’/65536)) 
points.  Triangles  0  through  k  occupy  processors  0  through 
k  (cube  address),  and  so  forth.  The  descriptions  of  the 
triangles  must  be  generated;  this  needs  several  vector  sub- 


635 


tractions  and  cross-products  to  compute  normal  equations 
for  planes.  The  computed  triangle  descriptions  comprise  4 
plane  equations, 

AiX  +  Biy  +CiZ  +  D  =  0 

each  of  which  contains  4  32-bit  numbers;  the  entire  descrip¬ 
tion  is  512  bits  long.  The  descriptions  of  all  m  triangles  can 
be  copy-scanned  to  replicate  them  k  times,  and  then  sent, 
in  one  step,  to  the  correct  processors.  Then,  points  are  sent 
to  the  sets  of  triangles  against  which  they  are  to  be  tested. 
The  first  n  points  are  sent  to  processors  0 ...  n  —  1,  the  next 
»  to  processors  to  . . .  m  +  n  —  1,  and  so  forth. 

Segments  bits  are  inserted  at  the  termination  of  each 
set  of  triangles,  In  each  testing  step,  the  description  of  the 
point  at  the  beginning  of  each  set  of  points  is  copy-scanned 
across  the  set  of  triangles.  All  triangles  test  the  active 
points  in  parallel.  Then,  the  descriptions  of  the  points  are 
sent  left.  This  brings  a  new  point  to  the  beginning  of  each 
section  of  triangles,  ready  to  be  copied  to  all  the  triangles 
in  the  next  step.  Since  there  are  n  =  (ce*'iins(3m}/65536)) 
steps,  the  total  time  required  depends  on  n. 

Many  of  the  triangles  in  this  example  are  themselves 
completely  occluded  by  larger  triangles.  We  suggest  a  first 
filtering  step  so  that  small  triangles  which  are  occluded 
by  large  triangles  can  be  eliminated  before  the  problem  is 
partitioned  into  groups.  Each  triangle  is  projected  onto  a 
projection  plane,  anywhere  in  the  visible  region,  orthogonal 
to  a  line  of  right  from  the  origin.  Then  the  projection  plane 
is  subdivided  into  a  kxk  grid.  Each  triangle  can  determine 
all  squares  it  covers  completely.  At  the  intersection  points 
in  this  grid,  each  triangle  determines  its  z- value.  Each  tri¬ 
angle  should  cover,  on  average,  one-sixth  of  the  rectangles. 
Now,  each  triangle  generates  a  copy  of  itself  for  each  rect¬ 
angle  it  covers;  choose  k  so  that 

Jfc’/6  *  m  <  65535 

where  m  is  the  number  of  triangles,  so  that  each  triangle- 
rectangle  pair  can  be  allocated  to  a  processor.  Using 
m  =  1000,  k  is  18.  The  triangle-rectangle  pairs  contain 
the  zmin,zmax  of  each  triangle  in  that  rectangle.  Orga¬ 
nize  the  processors  contiguously  by  rectangle.  Then  scan 
the  triangles  to  find  the  minimum  of  the  zmax  covering 
the  rectangle.  Eliminate  all  triangles  in  a  rectangle  whose 
zmin  is  greater  than  this  zmax;  these  triangles  cannot  be 
seen.  This  should  significantly  reduce  both  the  number  of 
triangles,  and,  consequently,  the  number  of  vertices.  Then, 
using  the  process  described  above,  the  number  of  iterations 
should  be  significantly  smaller. 

9  Graph  matching 

The  problem  is  to  compute  the  list  of  the  occurrences  of 
(an  isomorphic  image  of)  a  small  graph  H  as  a  subgraph 
of  a  larger  graph  G.  We  will  outline  a  method  to  distribute 


the  matching  process  among  the  processors  of  the  CM.  A 
similar  solution  for  objection  recognition  is  described  in 
[Harris86].  We  utilize  dynamic  allocation  of  processors  to 
matchings.  A  partial  matching  is  contained  in  a  processor. 
At  each  step  in  the  graph  matching  algorithm  a  match¬ 
ing  (processor)  acquires  the  information  necessary  to  de¬ 
termine  all  legal  successors.  It  then  finds  processors  to 
continue  with  the  new  matchings,  and  is  then  returned  to 
the  pool  of  free  processors.  The  information  concerning  the 
graphs  can  be  stored  in  several  ways  in  the  memory  of  the 
Connection  Machine.  We  can  store  the  adjacency  list  for 
a  vertex  in  G  as  a  bit  vector  in  a  processor.  The  graph 
G  can  be  stored,  with  many  copies,  throughout  the  CM. 
Each  matching  processor  can  access  these  copies  randomly, 
so  that  contention  among  the  processors  is  minimized.  The 
description  of  the  smaller  graph  H  can  be  stored  in  each 
matching  processor,  as  well  as  the  partial  matching  it  is 
expanding. 

Initially  no  processors  are  allocated.  We  use  rendezvous 
ol/oca<ion[Hillis85j  to  assign  processors  to  matchings.  The 
order  in  which  vertices  in  H  are  matched  to  G  can  be  pre¬ 
computed  to  maximize  the  number  of  vertices  in  H  adja¬ 
cent  to  the  next  vertex  to  be  expanded.  In  each  phase  of 
matching  generation,  a  matching  at  level  k,  1  <  k  <  |Zf  |, 
must  expand  itself  to  all  legal  successor  matchings  at  the 
next  level.  Matching  processors  may  be  expanding  at  many 
different  levels,  since  resource  limitations  may  delay  expan¬ 
sion  until  some  processor  fails,  and  is  returned  to  the  pool. 
To  expand  itself,  a  matching  must  know,  first,  the  neigh¬ 
bors  of  hT*+i,  and,  second,  the  vertices  in  G  to  which  those 
neighbors  have  been  matched.  These  data  allow  the  match¬ 
ing  at  level  k  to  prune  its  expansion,  generating  only  legal 
successors.  The  description  of  H  is  stored  locally  in  each 
processor.  To  recover  the  neighbors  of  Hk+i,  each  processor 
steps  through  the  description  of  H ,  until  they  encounter  the 
k  +  1**  entry,  and  then  records  the  contents  of  this  entry. 
This  step  finds  the  neighbors  of  the  new  vertex  in  H . 

Each  expanding  matching  examines  the  neighbors  of 
H/,+ 1  to  determine  the  nodes  in  G  to  which  they  have  been 
matched.  The  neighbors  of  each  such  vertex  in  G  must  be 
retrieved  from  the  distributed  representations  of  G,  using 
a  send  operation.  Now,  we  must  compute  the  intersection 
of  these  bit  vectors,  describing  all  possible  nodes  in  G  adja¬ 
cent  to  the  matches  in  G  of  neighbors  of  Hs+i-  This  can  be 
done  in  time  linear  in  the  number  of  nodes  in  G,  but  such 
bit  operations  are  fast.  Then  we  exclude  from  the  inter¬ 
section  all  nodes  already  matched  in  the  current  matching, 
leaving  the  possible  expansions  in  G.  All  are  legal,  that 
is,  the  nodes  in  G  to  be  matched  are  unmatched,  and  are 
adjacent  to  existing  constraining  matches  from  H.  If  this 
set  is  empty,  the  matching  fails. 

The  entire  computation  to  expand  a  matching  has  sev¬ 
eral  parts,  each  taking,  at  most,  time  linear  in  the  rise  of 
G.  Then,  each  matching  requests  processors  for  its  succes¬ 
sors  from  the  free  pool,  in  a  processor  allocation  step.  The 
description  of  a  partial  matching  at  the  k **  level  can  be 


636 


distributed  to  the  successor  nodes  during  allocation.  The 
matching  processor  then  returns  itself  to  the  free  pool  of 
processors.  A  priority  mechanism  can  be  implemented  to 
favor  matchings  which  are  nearer  completion.  The  over¬ 
head  of  allocation  and  distribution  should  be  no  more  costly 
than  the  entire  computation  to  determine  legal  successors. 

The  total  throughput  of  this  algorithm  can  be  measured 
in  terms  of  the  number  of  partial  matchings  generated  in 
each  step.  The  critical  factor  in  this  problem  is  to  control 
the  number  of  active  matchings  so  as  to  allow  expansion, 
without  create  blocking.  The  process  should  monitor  itself 
to  record  the  average  number  of  successors  at  each  level, 
allowing  good  control  of  allocation.  This  discussion  by  no 
means  solves  this  difficult  problem;  there  are  many  thorny 
issues  of  control  and  task  allocation  yet  to  be  analyzed 
completely. 

10  Minimum-cost  path 

The  problem  is,  given  two  vertices  P,Q  of  G,  an  undirected 
graph  with  nonnegative  real-valued  weights  on  the  edges, 
to  find  a  path  from  P  to  Q  along  which  the  sum  of  the 
weights  is  minimum. 

The  graph  can  be  represented  as  an  adjacency  list  in  the 
CM.  The  algorithm,  a  CM  implementation  of  Dijkstra’s 
algorithm,  is  given  in  [Hillis85j.  Each  step  in  computing 
the  shortest  path  consists  in  each  vertex  sending  to  each  of 
its  neighbors  the  distance  from  the  source  to  itself  plus  the 
length  of  the  connecting  edge  along  which  the  message  is 
sent.  Each  step  involves  a  send  operation,  using  the  router. 
The  receiver  compares  all  incoming  values  and  selects  the 
minimum. 

Consider  sending  messages  only  when  the  distance  from 
the  source  is  less  than  infinity  (some  initial  value  for  all 
processors).  This  reduces  the  number  of  conflicts  at  many 
stages.  Initial  experiments  require  only  a  few  router  op¬ 
erations  per  step.  The  number  of  steps  depends  on  the 
diameter  (the  length  of  the  longest  path  in  the  graph  ex¬ 
plored).  The  algorithm  stops  when  no  processor  changes 
its  value  as  the  result  of  the  messages  it  has  received. 

11  Summary 

We  have  identified  several  powerful  communication  pat¬ 
terns  which  allow  us  to  utilize  the  Connection  Machine  to 
design  efficient  algorithms  for  vision.  The  powerful  commu¬ 
nication  methods  permit  construction  of  extended  image 
structures  in  logarithmic  time.  Schemes  can  easily  arrange 
data  in  the  Connection  Machine  so  that  rapid  distribution 
of  information  using  the  scan  mechanism  allows  efficient 
algorithms.  These  algorithms  represent  a  subset  of  those 
currently  designed  for  or  implemented  on  the  Connection 


Machine.  Other  general  algorithms  for  vision  are  described 
in  [Harris86]  and  [Little86a]. 

We  have  described  the  algorithms  in  terms  of  fundamen¬ 
tal  operations  of  the  Connection  Machine.  We  will  give  a 
quick  outline  of  the  comparative  running  times  of  the  prim¬ 
itives  we  have  been  discussing.  Let  us  take  as  a  time  unit 
the  time  for  a  1-bit  operation.  The  actual  time  required 
by  this  operation  will  change  as  the  Connection  Machine 
evolves.  In  these  units,  then,  accessing  the  memory  of  a 
neighboring  processor  via  the  NEWS  network  takes  3  units 
per  bit.  Global  operations,  such  logical  or  and  logical  and 
need  less  than  50  units.  Counting  all  active  processors  re¬ 
quires  200  units.  Routing  and  scanning  uses  500  units. 
Adding  16-bit  quantities  takes  20  units,  while  multiplying 
them  takes  250  units.  The  relative  speed  of  these  oper¬ 
ations  may  change  in  future  versions  of  the  Connection 
Machine.  In  another  document[Little86a],  we  give  more 
detailed  descriptions  of  these  algorithms  and  their  running 
times,  on  the  present  machine.  In  actual  time,  for  exam¬ 
ple,  our  present  implementation  of  Canny  edge  detection, 
operating  on  a  256x256  image,  from  image  to  linked  edge 
pixels,  takes  less  than  50ms.  Many  of  the  other  algorithms 
described  above  achieve  similar  times,  and  show  speedup  of 
several  orders  of  magnitude  over  present  implementations 
on  serial  computers. 

12  Acknowledgments 

Support  for  the  A.I.  Laboratory’s  artificial  intelligence 
research  is  provided  in  part  by  the  Advanced  Research 
Projects  Agency  of  the  Department  of  Defense  under  Office 
of  Naval  Research  contract  N00014-85-K-0214. 

References 

[Aggarwal85]  A.  Aggarwal,  B.  Chazelle,  L.  Guibas,  C. 
O’Dunlaing  and  C.  Yap,  “Parallel  Computational  Ge¬ 
ometry”,  Proe.  S5th  IEEE  Symp.  Found,  of  Comp. 
Set.,  1985,  468-477. 

[Blelloch86]  G.  Blelloch,  “Parallel  Prefix  vs.  Concur¬ 
rent  Memory  Access”,  Technical  report  (in  prepara¬ 
tion).  Thinking  Machines  Corporation,  Cambridge, 
Massachusetts,  1986. 

[Blelloeh87]  G.  Blelloch  and  J.  Little,  “Convex  Hull  Stuff”, 
submitted  to  Third  Annual  Symposium  on  Computa¬ 
tional  Geometry,  1987. 

[Canny86]  J.F.  Canny,  “A  Computational  Approach  to 
Edge  Detection”,  IEEE  Transactions  on  Pattern  Analy¬ 
sis  and  Machine  Intelligence,  Nov.  1986,  Vol.  8,  No.  6, 
679-698. 


637 


[Cook80]  S.A.  Cook,  “Towards  a  Complexity  Theory  of 
Synchronous  Parallel  Computation”,  Int.  Symp.  ue- 
ber  Lope  und  Algorithmic  zu  Ehren  vo  Proffessor  Ernst 
Specker,  Zurich,  1980. 

[Drumheller86]  M.  Drumheller  and  T.  Poggio,  “Parallel 
Stereo”,  Proc.  of  IEEE  Conference  on  Robotics  and 
Automation,  San  Francisco,  1986. 

[Harris86]  J.G.  Harris  and  A.M.  Flynn,  “Object  Recog¬ 
nition  Using  the  Connection  Machine’s  Router”,  Proe. 
IEEE  1986  Conf.  Computer  Vision  and  Pattern  Recog¬ 
nition,  1986,  134-139. 

[Hillis85]  D.  Hillis,  The  Connection  Machine,  MIT  Press, 
Cambridge,  1985. 

[Lasser86]  C.  Lasser,  “The  Complete  *Lisp  Manual", 
Thinking  Machines  Corporation,  Cambridge,  Mas¬ 
sachusetts,  1986. 

[Little86a]  J.J.  Little,  “Parallel  Algorithms  for  Computer 
Vison”,  AIM-928,  MIT  AI  Laboratory,  November  1986. 

[Little86b]J.J.  Little  and  H.  Bulthoff  “Parallel  Computa¬ 
tion  of  Optical  Flow”,  AIM-929,  MIT  AI  Laboratory, 
November  1986. 

[Lim86]  W.  Lim,  A.  Agrawal  and  L.  Nekludova,  “A  Fast 
Parallel  Algorithm  for  Labeling  Connected  Components 
in  Image  Arrays”,  submitted  to  the  International  Con¬ 
ference  on  Computer  Vision,  1987. 

(Huertas85)  A.  Huertas,  and  G.  Medioni,  “Detection 
of  Intensity  Changes  with  Subpixel  Accuracy  using 
Laplacian-of-Gaussian  Masks”,  to  appear  in  IEEE 
Transactions  on  Pattern  Analysis  and  Machine  Intelli¬ 
gence. 

[Oppenheim75]  A.V.  Oppenheim  and  R.W.  Schafer,  Digital 
Signal  Processing,  Prentice-Hall,  Englewood  Cliffs,  New 
Jersey,  1975. 

[Overmars81]  M.H.  Overmare  and  J.  Van  Leeuwen,  “Main¬ 
tenance  of  Configurations  in  the  Plane”,  Journal  oj 
Computer  and  System  Sciences,  23,  166-204. 

[Preparata85]  F.P.  Preparata  and  M.I.  Shamos,  Computa¬ 
tional  Geometry  -  An  Introduction,  Springer,  New  York, 
1985. 

[Wyllie79]  J.C.  Wyllie,  The  Complexity  of  Parallel  Compu¬ 
tations,  TR  79-387,  Department  of  Computer  Science, 
Cornell  University,  Ithaca,  New  York,  August  1979. 


638 


SOLVING  THE  DEPTH  INTERPOLATION  PROBLEM 
ON  A  FINE  GRAINED,  MESH-  AND  TREE-CONNECTED  SIMD  MACHINE 


Dong  J.  Choi 


Department  of  Computer  Science 
Columbia  University 
New  York,  N.Y.  10027 


Abstract 

This  paper  discusses  solving  the  depth  interpolation 
problem  on  a  particular  parallel  architecture  (a  fine  grained, 
mesh-  and  tree-connected,  SIMD  machine).  Many  low  level 
computer  vision  problems,  including  depth  interpolation,  can  be 
cast  as  solving  a  symmetric  positive  definite  (SPD)  matrix. 
Usually,  the  resulting  SPD  matrix  is  sparse.  In  a  previous  result, 
we  showed  how  the  adaptive  Chebyshev  acceleration  method  for 
sparse  SPD  matrices  could  be  run  on  this  particular  architecture. 
Here,  we  show  how  the  conjugate  gradient  method  can  be  run 
under  same  framework.  Lastly,  we  show  simulation  results  for 
fairly  reasonably  large  synthetic  images  from  two  methods,  and 
compare  them  with  the  results  from  one  of  the  commonly  used 
basic  iterative  methods,  the  Gauss-Seidel  method.  We  also 
detail  our  future  plans.1 


1.  Depth  Interpolation 

Human  perception  is  a  vivid  one  if  dense  and  coherent 
surfaces  in  depth.  This  suggests  that  there  exists  a  visible- 
surface  reconstruction  process  that  transforms  sparse  information 
into  a  dense  surface  representation.  With  the  sparse  depth  and/or 
orientation  constraints  generated  from  low  level  visual  processes, 
a  depth  interpolation  process  would  compute  the  depth  of  the 
visible  surfaces  at  every  point  explicitly. 

Grimson  formulated  one  approach  to  the  depth 
interpolation  problem  (Grim  81].  He  suggested  an 
“interpolation"  method.  Given  a  set  of  scattered  depth 
constraints,  the  surface  which  best  fits  the  known  constraints  is 
that  which  passes  through  the  known  points  exactly  and 
minimizes  the  expression  referred  to  as  the  quadratic  variation  of 
the  surface.  He  used  a  gradient  descent  method  to  find  such  a 
surface  and  slow  convergence  rates  were  observed  in  his  work. 

Terzopoulos  worked  further  on  surface  representation 
(Terz  84],  Instead  of  Grimson’s  “interpolation”  approach,  he 
proposed  an  “approximation”  method  where  the  discrete 
potential  energy  functional  associated  with  the  surface  is 
minimized.  In  his  formulation,  known  depth  and/or  orientation 
constraints  contribute  as  spring  potential  energy  terms. 

The  discrete  form  of  the  visible-surface  reconstruction 
problem  is  described  as  the  solution  of  a  large  sparse  linear 
system  of  equations.  The  nonzero  coefficients  of  each  equation  is 
specified  as  summations  of  computational  molecules.  Given  the 
depth  constraints  and  the  orientation  constraints,  a  set  of 
computational  molecules  computes  the  nonzero  coefficients  of 
the  linear  system  by  local  computations.  Because  of  the 
symmetric  nature  of  the  computational  molecules,  it  can  be 


'This  research  was  iponaored  by  the  Defenae  Advanced  Research  Projecti 
Agency  under  contract  N00039-84-C-0165. 


easily  shown  that  the  resulting  matrix  is  symmetric.  Furthermore, 
Terzopoulos  shows  the  stronger  result  that  the  matrix  generated 
is  symmetric  and  positive  definite  (SPD).  The  matrix  is  also 
sparse.  Even  for  nodes  which  are  sufficiently  distant  from  a 
boundary  where  the  depth  is  discontinuous,  they  interact  with 
only  12  neighbors,  all  of  them  at  most  only  2  nodes  away. 
Terzopoulos  used  a  multi-grid  approach  with  the  Gauss-Seidel 
relaxation  method  at  each  relaxation  sweep  to  speed  up  the 
convergence  rate. 

We  follow  the  Terzopoulos’  formulation  on  visible  surface 
reconstruction  and  use  the  computational  molecules  proposed  by 
him.  However,  we  present  an  alternative  depth  interpolation 
process  using  the  theoretically  better  iteration  methods,  which 
speed  convergence  and  are  amenable  to  certain  classes  of  parallel 
computers.  In  our  previous  paper  [Choi  85],  we  used  the 
adaptive  Chebyshev  acceleration  method.  In  this  paper,  we 
present  another  efficient  method,  the  conjugate  gradient  method. 


2.  Conjugate  Gradient  Method 

The  depth  interpolation  problem  has  been  cast  as  solving  a 
large  system  of  linear  equations2 

Ax  -  b  (1) 

where  A-A*  >0isannxn  hermitian3  and  positive  definite 
matrix  [Wozn  80]. 

We  solve  the  equation  (1)  iteratively  by  constructing  a 
sequence  {x®}  converging  to  the  solution  a  -  A'^b.  Let  B  =  B 
>  0  be  a  matrix  which  commutes  with  A:  BA  -  AB. 

Let  IWia  -  l|Birax)|  -  (Bx,  x)m,  where  M  -  (*,  x)m. 

Let  x®)  be  an  initial  approximation  and  consider  a  class  of 
iteration  methods  for  which  the  error  formula  satisfies  the 
relation 

x®-a  -  Wl{A)(^~a), 

where  W;  is  a  polynomial  of  degree  at  most  i  and  W,(0)  - 
1.  We  seek  the  polynomials  V/i  such  that  the  error  e;  - 
II*® -oil*  is  minimized.  The  solution  is  given  by  the  orthogonal 
polynomials  defined  as  follows.  Let 

xW>-«  - 

/“I 


2In  this  paper,  we  use  x  to  denote  the  depth  vector.  In  a  previous  paper,  we 
used  u. 

3Since  the  matrix  A  is  both  real  and  symmetric,  it  is  hennitian. 


•f 

I 


639 


where  j8  m  eigenvector  of  A  associated  with  the 
eigenvalue  Xy:  A\0)  _  |£(/)||  _  1,  o  <  X,  <  Xj  <  •  •  •  <  XM, 

wife  m  £  h  and  Cj  #  0  for  j  -  1,  2,  . . . ,  m.  From  the 
orthogonality  of  (he  polynomials  Wt  it  follows  that  they  satisfy  a 
three-term  recurrence  formula.  This  form  is  defined  as  follows: 

Wq(X)  -  1, 

W,(X)  -  1-coX, 

H'i+lft)  -  W^-c^Wfk) 


~ “i {M'j-ifX) - Wfk)  +  c£W{k)},  HI, 


where 

(W'.-.W,) 
c‘  “  (Xw Ti,W$ 

“0  -  0, 

(lV1-ciXW1,i.(Wi.,-Wi)  +  c1Wt.) 

u  - _ _ _ " _ _ 

(Wj.!  -  vy,  +  CjXWj,  j(Ww  -  Wj)  +  CjIFp 


From  this  we  get  the  three-term  recurrence  formula  for  the 
sequence  {x^}, 

r<0  -  Ax®-b. 

2<0  _  *(0  —  c-fW, 

yo  _  ^n-rW 

jr<,+1)  -  t<0  _  (2) 

where 

c  (^.JCaW-g)) 

'  (rffl.*®)  ’ 


“0  -  o, 

II  _  WM) 

‘  IwT 

In  exact  arithmetic,  the  conjugate  gradient  method  (2),  (3) 
converges  in  m  steps. 


(3) 


For  B  -  A  we  minimize  ||Aw(xW  —  a)||.  This  corresponds 
to  the  classical  conjugate  gradient  method.  After  further 
simplification,  we  have 


c  - 

'  (AO.yUfO)' 


“o  -  0, 

ii  _  (y®,A*i>-b) 


12 1. 


(4) 


On  examination  of  the  equations  (2)  and  (4),  we  find  that 
we  seem  to  need  tour  matrix  multiplications.  But  two  matrix 
multiplications,  AyW  and  Az^,  can  be  eliminated  with 
substitution!.  Now  we  have 


1  "  (WAS 0)' 

“o  -  o. 

(yW,  rW  -  c;Ar®) 

u •  ■  ■  - - — .  i^l. 

1  (y®,  Ax^‘-1)  -  AxW  +  c,-  Ar®) 


We  show  how  this  method  is  easy  to  implement  in  the 
following  architecture,  and  investigate  its  efficiency. 


3.  Architectural  Background 

As  we  have  remarked  in  our  previous  paper,  a  parallel 
architecture  support  to  meet  the  particular  structure  of  our 
application  demands  following  characteristics: 

•  fine  grained 

•  SIMD 

•  local  communication 

•  global  communication. 


In  this  section,  we  derive  a  model  of  SIMD  computation 
that  has  all  necessary  characteristics.  Various  features  of  this 
model  will  be  extracted  separately  from  several  actual  SIMD 
machines  that  have  been  built. 

The  time  complexity  analysis  of  parallel  machines 
involves  two  factors:  the  internal  computational  speed  of  each 
processing  element  (PE)  and  the  communication  speed  to  move 
around  data  between  the  PEs. 

The  computational  speed  of  each  PE  depends  on  the 
complexity  of  the  hardware  circuitry  built  into  it  In  computer 
vision  applications,  we  have  a  tremendous  amount  of  data  to  be 
processed.  Therefore,  it  forces  the  designer  to  design  each  PE  as 
simple  as  possible  to  accommodate  more  and  more  PEs. 
Nevertheless,  we  need  good  computational  capability  as  well, 
such  as  the  floating  point  calculations  as  required  in  this 
application. 

One  such  PE  design  which  made  effort  to  meet  these 
computational  requirements  is  that  of  the  Massively  Parallel 
Processor  (MPP).  For  the  128  x  128  square  mesh,  the  actual 
execution  speed  of  470  million  addition  operations  per  second, 
291  million  multiplication  operations  per  second,  and  163 
million  division  operations  per  second  have  been  reported  with 
32-bit  floating  point  data  format  in  [Gilm  83,  p.  166].  We  took 
these  numbers  and  converted  them  into  equivalent  machine 
cycles  in  Table  3-1.  We  took  into  account  of  100  nanosecond 
machine  cycle  time  of  die  MPP. 

In  MPP,  NON-VON  and  Connection  Machine,  mesh 
communication  instructions,  which  handle  local 
communications,  execute  in  single  machine  cycle.  We  assumed 
a  tingle  bit  data  path  between  PEs  and  a  floating  point  data  it 
moved  from  die  memory  of  one  PE  to  the  other  PE’s  memory. 
For  the  transfer  of  each  bit,  we  assumed  a  3  machine  cycle 
sequence:  read  a  bit,  send  it  through  mesh  connection,  and  then 
write  it  Therefore,  for  32-bit  floating  point  data,  it  will  r»it»  160 
machine  cycles  for  completion. 


640 


For  global  communications,  we  assumed  the  tree  topology 
of  NON-VON.  We  assumed  that  the  instructions  which  carry 
out  the  tree  communications  between  adjacent  levels  execute  in 
2  machine  cycles  following  the  experience  of  NON-VON  chip 
and  prototype  system  design/implementation  efforts.  We 
assumed  again  a  single  bit  data  path  between  PEs  and  a  floating 
point  data  is  moved  from  the  memory  of  one  PE  to  the  other 
PE’s  memory.  For  the  transfer  of  each  bit,  we  assumed  a  6 
machine  cycle  sequence:  read  a  bit,  send  it  through  tree 
connection,  and  then  write  it.  Therefore,  for  32-bit  floating  point 
data,  it  will  take  192  machine  cycles  for  completion. 

In  the  Connection  Machine,  global  communication  is 
handled  by  Boolean  n-cube  topology.  For  the  prototype  with  the 
size  of  256  x  256  mesh,  the  communications  bandwidth  of  5.0  x 
1010  bits/second  for  the  FFT  Pattern  has  been  reported  in  [Hill 
86,  p.  72].  There  is  a  single  bit  data  path  between  chips  and  it 
will  take  63  machine  cycles  to  move  a  single  bit  to  the  cube 
neighbor.  We  took  into  account  of  4  MHz  clock  and  4,096 
routeis.  The  reason  why  we  have  such  a  big  number  is  that  the 
messages  are  delivered  across  each  dimension  in  sequence  for 
every  petit  cycle  in  the  Connection  Machine.  For  a  32-bit 
floating  point  data,  it  will  take  2,016  machine  cycles. 

Our  discussion  is  summarized  in  Table  3-1.  This  model  of 
SIMD  computation  will  be  used  throughout  this  paper. 


4.  Parallelization 


4.1.  Storage  Space  Analysis 

In  the  adaptive  Chebyshev  method,  we  need  16  1-bit  flags 
and  16  floating  point  numbers  for  the  storage  space  of  each  PE. 
Assuming  32  bits  for  tl*  floating  point  number  representation, 
we  need  528  bits  per  PE.  In  the  conjugate  gradient  method,  we 
need  16  1-bit  flags  and  21  floating  point  numbers.  Here,  we 
need  688  bits  per  PE. 


42.  Time  Complexity  Analysis 

Suppose  that  we  have  ixj  square  mesh  at  the  leaf  of  the 

tree. 

For  the  adaptive  Chebyshev  acceleration  method,  we  show 
the  number  of  operations  in  local  and  global  computations  and 
then  totals  for  each  operation  in  Table  4-2.  By  local 
computations  we  mean  those  which  require  internal 
computations  carried  out  inside  of  PE  and  optional  mesh 
communications  with  nearby  neighbors.  The  computation  of 
two  vectors,  8  and  falls  into  this  group.  The  matrix 

computation,  Gu®,  is  an  embedded  step  of  the  computation  of 
the  vector  6.  By  global  computations  we  mean  those  which 
require  tree  communications  for  global  summary  and  other 
internal  computations  carried  out  prior  to  or  during  global 
summary.  The  computation  of  three  vector  norms  are  put  in  this 
group. 

Under  the  column  designated  "TOTAL”,  the  number  of 
operations  are  multipled  by  the  number  of  machine  cycles  which 
were  defined  before  in  Table  3-1.  In  summary,  the  total  number 
of  machine  cycles  required  for  each  iteration  of  the  adaptive 
Chebyshev  acceleration  method  is  given  by 
4392  x(tog2  s)  +  16453.  For  the  tree  with  128  x  128  square 
mesh  at  the  leaf,  the  total  number  of  machine  cycles  is  47197. 


We  show  the  result  for  the  conjugate  gradient  method  in 
Table  4-4.  The  computation  of  four  vectors  and  two  matrix 
computations  fall  into  local  computations.  The  matrix 
computations,  At®  and  Ar^’’\  are  embedded  steps  of  the 
computation  of  the  vector  r®  and  the  coefficient  c;,  respectively. 
The  computation  of  four  inner  products  to  compute  two 
coefficients,  c;  and  ut,  are  put  in  global  computations. 

The  total  number  of  machine  cycles  required  for  each 
iteration  of  the  conjugate  gradient  method  is  given  by 
5856  x(Jogjf)  + 32307.  For  the  tree  with  128  x  128  square 
mesh  at  the  leaf,  the  total  number  of  machine  cycles  is  73299. 


5.  Simulation  Results 

We  have  rewritten  die  NON-VON  simulator  to  handle  the 
floating  point  operations.  It  has  been  rewritten  in  FORTRAN  77 
and  transported  to  IBM  4381.  It  now  handles  up  to  a  128  x  128 
square  image. 

In  our  simulation  work,  the  synthetic  image  was  a  constant 
depth  plane  (a  -  1.0).  The  shape  of  the  boundary  of  the  plane 
was  a  square,  with  size  128  x  128.  The  depth  constraints  were 
scattered  randomly  over  the  plane.  The  density  of  the  depth 
constraints  were  15%,  30%,  and  50%. 

The  root  mean  square  error  (RMSE)  at  the  ith  iteration  is 
defined  as  follows: 

RMS&t  -  ((|j  [xf-aj]2)/n)m. 

In  Table  5-1,  we  show  the  number  of  iterations  i  to  attain 
the  specified  RMSE.  The  results  are  tabulated  side  by  side  for 
three  different  iteration  methods,  the  conjugate  gradient,  the 
adaptive  Chebyshev  acceleration,  and  the  Gauss-Seidel  method. 
We  observe  that  the  conjugate  gradient  method  performs  best  in 
the  sense  that  it  takes  the  least  number  of  iterations.  The  adaptive 
Chebyshev  acceleration  method  comes  next  and  the  Gauss- 
Seidel  performs  worst.  But  we  should  note  that  each  step  of  the 
iteration  of  the  first  two  methods  is  completely  parallelized  so 
that  overall  execution  is  much  faster  compared  to  the  Gauss- 
Seidel  method  where  the  computation  is  done  in  serial  fashion. 
As  the  depth  constraints  become  sparser,  the  depth  interpolation 
problem  itself  becomes  inherently  harder  to  solve  and  takes  more 
iterations.  Even  here,  the  degradation  in  the  Gauss-Seidel 
method  turns  out  to  be  the  worst 

We  have  derived  before  that  the  total  number  of  machine 
cycles  per  iterations  are  47197  for  the  adaptive  Chebyshev 
acceleration  method  and  73299  for  the  conjugate  gradient 
method,  respectively.  In  Table  5-2,  we  show  the  normalized 
number  of  iterations  for  two  methods.  For  the  adaptive 
Chebyshev  acceleration  method,  we  use  the  same  numbers  as  in 
Table  5-1.  For  the  conjugate  gradient  method,  we  have 
multiplied  the  number  of  iterations  in  Table  5-1  by  73299  / 
47197  -  1.5530.  After  normalization,  the  conjugate  gradient 
method  performs  still  better  than  the  adaptive  Chebyshev 
acceleration  method.  This  is  in  part  due  to  the  errors  in  the 
initial  estimates  of  the  eigenvalues.  In  the  first  few  iterations, 
the  conjugate  gradient  method  performs  much  better,  but  as  more 
computations  are  done  the  estimate  of  the  eigenvalues  (in  this 
case,  the  largest  eigenvalue  only)  gets  better.  Thus,  the 
difference  of  the  number  of  iterations  between  the  conjugate 
gradient  method  and  the  adaptive  Chebyshev  acceleration 
method  gets  smaller. 


641 


We  have  summarized  other  numerical  values  in  Table  5-3. 
There,  the  measures  are  listed  for  three  different  densities  of 
depth  constraints  for  each  iteration  method.  They  are  the 
number  of  iterations;  the  minimum,  the  maximum,  and  the 
average  of  the  final  depth  values;  and  the  2-norms  of  the  error 
vector.  Furthermore,  we  have  listed  ||j41/2(x^  -  a)||  for  the 
conjugate  gradient  method  and  the  final  estimate  of  the  largest 
eigenvalue,  ME,  for  the  adaptive  Chebyshev  acceleration 
method. 

When  we  examine  the  final  depth  values,  they  show 
another  aspect  of  the  depth  interpolation  problem  becoming 
harder  as  the  depth  constraints  become  sparser.  With 
comparable  or  sometimes  better  average  values,  the  minimum 
and  the  maximum  values  deviate  further  from  the  solution.  This 
phenomenon  is  common  to  all  three  iteration  methods.  For 
example,  for  the  conjugate  gradient  method,  we  have  smaller 
values  of  ||4,/2(xW-ct)||,  the  quantity  being  minimized  in  this 
method,  as  the  depth  constraints  become  sparser.  Nevertheless, 
we  have  comparable  2-norms  of  the  error  vector  and  the  average 
values,  and  worse  minimum  and  maximum  values.  In  the 
adaptive  Chebyshev  acceleration  method,  the  initial  estimates  of 
the  smallest  and  the  largest  eigenvalues  were  -  ||G||„.  and  0.0, 
respectively. 


6.  Conclusion  and  Future  Plans 

In  this  paper,  we  showed  how  the  conjugate  gradient 
method  can  be  applied  to  SIMD  machines  for  those  computer 
vision  problems  where  the  resulting  matrix  is  SPD. 

We  have  analyzed  the  space  and  time  complexity  of  two 
iteration  methods  based  on  our  SIMD  model  derived  from  actual 
machines  built.  In  particular,  we  analyzed  the  computational  and 
the  communication  costs  of  parallel  computing.  Also,  we  have 
analyzed  two  modes  of  communications,  local  and  global, 
necessary  for  local  interactions  and  global  summary, 
respectively. 

In  this  paper,  a  synthetic  image  with  constant  depth  was 
described.  Other  synthetic  images  such  as  sphere  or  cylinder  will 
be  pursued  in  future.  Also,  we  plan  to  run  our  algorithms  to  real 
images  such  as  range  data  from  scanner. 


Acknowledgements 

The  author  is  grateful  to  J.  R.  Render,  G.  W.  Wasilkowski, 
and  D.  E.  Shaw  for  their  advice  and  support  during  this  work. 


References 

[Choi  851  Choi,  D.  J.  and  Render,  J.  R. 

Solving  tte  Depth  Interpolation  Problem  with 
the  Adaptive  Chebyshev  Acceleration 
Method  on  a  Parallel  Computer. 

In  Image  Understanding  Workshop ,  pages 
219-223.  1985. 


[Gilm  83] 


[Grim  81] 


[Hill  86] 

[Lee  85] 


[Shaw  84] 


[Terz  84] 


[Wozn  80] 


[Youn  81] 


Gilmore,  P.  A. 

The  Massively  Parallel  Processor  (MFP) :  A 
Large  Scale  SIMD  Processor. 

In  Proceedings  ofSPIE ,  pages  166-174. 

1983. 

Grimson,  W.  E.  L. 

From  Images  to  Surfaces. 

MIT  Press,  1981. 

Hillis,  W.  D. 

The  Connection  Machine. 

MIT  Press,  1986. 

Lee,  D. 

Contributions  to  Information-based 

Complexity,  Image  Understanding  and 
Logic  Circuit  Design. 

PhD  thesis,  Columbia  University,  October, 
1985. 

Shaw,  D.  E. 

Organization  and  Operation  of  a  Massively 
Parallel  Machine. 

Technical  Report,  Columbia  University, 
October,  1984. 

Terzopoulos,  D. 

Multiresolution  Computation  of  Visible- 
surface  Representations. 

PhD  thesis,  Massachusetts  Institute  of 
Technology,  January,  1984. 

Wozniakowski,  H. 

Roundoff-Error  Analysis  of  a  New  Class  of 
Conjugate-Gradient  Algorithms. 

Linear  Algebra  and  its  Applications, 
29:507-529,  1980. 

Young,  D.  M.  and  Hageman,  L.  A. 

Applied  Iterative  Methods. 

Academic  Press,  1981. 


Operations 

Data  Type 

Execution  speed 
(machine  cycles) 

Addition,  Subtraction 

32-bit  fl.  point 

348 

Multiplication 

32-bit  fl.  point 

563 

Division 

32-bit  fl.  point 

993 

Mesh  Communication 

1  bit 

1 

Mesh  Communication 

32-bit  fl.  point 

160 

Tree  Communication 

1  bit 

2 

Tree  Communication 

32-bit  fl.  point 

192 

Table  3-1 :  Speed  of  Typical  Operations 


642 


Operations 

local 

global 

TOTAL 

(machine  cycles) 

Addition,  Subtraction 

16 

ft  (logs) 

(6(togr)  +  16)x348 

Multiplication 

11 

1 

12x563 

Division 

1 

0 

1x993 

Mesh  Communication 

16 

0 

16x160 

Tree  Communication 

0 

12(log  s)  +  3 

(12(log  s)+3)x  192 

Table  4-1:  Summary  of  Operations  (Chebyshev  Accel.  Method) 

Operations 

local 

global 

TOTAL 

(machine  cycles) 

Addition,  Subtraction 

30 

8  ( log  r)  +  3  (8  (log  s)  +  33)  x  348 

Multiplication 

18 

5 

23x563 

Division 

2 

0 

2x993 

Mesh  Communication 

32 

0 

32x160 

Tree  Communication 

0 

16  (log  s)+4  (16  (log  r)  +  4)  x  192 

Table  4-2:  Summary  of  Operations  (Conjugate  Gradient  Method) 


RMSE  Conjugate  Grad.  Chebyshev  Accel.  Gauss-Seidel 


50% 

30% 

15% 

50% 

30% 

15% 

50% 

30% 

15% 

0.5 

5 

8 

15 

27 

35 

54 

29 

51 

111 

0.2 

12 

20 

36 

41 

59 

98 

72 

130 

295 

0.1 

20 

31 

54 

55 

79 

128 

106 

195 

459 

0.05 

27 

42 

73 

65 

93 

157 

141 

264 

647 

0.02 

37 

57 

103 

79 

120 

206 

190 

363 

949 

0.01 

45 

69 

129 

94 

136 

250 

228 

444 

1230 

0.005 

52 

81 

152 

104 

160 

296 

268 

530 

1554 

0.002 

62 

98 

178 

118 

182 

353 

322 

653 

2023 

0.001 

70 

109 

199 

135 

207 

383 

364 

751 

2393 

Table  5-1:  Root  Mean  Square  Error 


RMSE  Conjugate  Gradient  Chebyshev  Accel. 


50% 

30% 

15% 

50% 

30% 

15% 

0.5 

7.8 

12.4 

23.3 

27 

35 

54 

0.2 

18.6 

31.1 

55.9 

41 

59 

98 

0.1 

31.1 

48.1 

83.9 

55 

79 

128 

0.05 

41.9 

65.2 

113.4 

65 

93 

157 

0.02 

57.5 

88.5 

160.0 

79 

120 

206 

0.01 

69.9 

107.2 

200.3 

94 

136 

250 

0.005 

80.8 

125.8 

236.1 

104 

160 

296 

0.002 

96.3 

152.2 

276.4 

118 

182 

353 

0.001 

108.7 

169.3 

309.1 

135 

207 

383 

Table  5-2:  Root  Mean  Square  Error  (normalized) 


Results  from  the  Conjugate  Gradient  Method 


d 

i 

max 

11*®  -  <x|| 

|  |A1/2(x«- 

50% 

70  .984603 

1.002782 

.999994 

.126194 

.0143727 

30% 

109  .980571 

1.003009 

.999988 

.130898 

.0096369 

15% 

199  .979287 

1.007600 

.999997 

.125836 

.0053095 

Results  from  the  Adaptive  Chebyshev  Acceleration  Method 

d 

i  X®  • 

*<0 

max 

tfVJ 

II*® -all 

me 

50% 

135  .985364 

1.000801 

.999687 

.124797 

.991694 

30% 

207  .979400 

1.000410 

.999826 

.126349 

.996296 

15% 

383  .971733 

1.001309 

.999929 

.127170 

.998905 

Results  from  the  Gauss-Seidel  Method 

d 

«'  ^mm 

x® 

max 

avg 

||*®  -  a|| 

50% 

364  .988639 

1.000352 

.999474 

.127688 

30% 

751  .981034 

1.000412 

.999687 

.128378 

15%  2393  .971534 

1.001434 

.999900 

.127979 

Table  5-3:  Other  Results 


643 


SURVEY  OF  PARALLEL  COMPUTERS* 

Hong  Seh  Lim  and  Thomas  O.  Binford 


Artificial  Intelligence  Laboratory,  Computer  Science  Department, 
Stanford  University,  Stanford,  CA  94305,  U.S.A. 


Abstract 

Computer  vision  systems  process  large  volumes  of  data 
and  require  huge  computing  power.  This  renders  con¬ 
ventional  sequential  computers  I/O  bound  or  compute 
bound.  Computers  with  parallel  processing  capability 
are  designed  to  address  these  problems.  We  survey 
some  of  these  parallel  computers,  both  general  pur¬ 
pose  and  those  oriented  towards  computer  vision.  Some 
of  the  surveyed  computers  are  in  production  and  can 
be  purchased  from  the  manufacturers;  others  are  in 
various  stages  of  development  in  research  laboratories. 
Rather  than  describing  each  computer  completely,  this 
survey  emphasizes  special  features  associated  with  each 
computer.  Tables  of  comparisons  of  the  surveyed  com¬ 
puters  are  given. 

1  Introduction 

Parallel  processing  has  been  successfully  used  to  re¬ 
duce  the  time  of  computation  for  a  wide  variety  of  ap¬ 
plications.  The  processing  of  large  amounts  of  data, 
the  need  for  real  time  computation,  the  use  of  com¬ 
putationally  expensive  operations,  and  other  demands 
that  would  make  a  task  too  time-consuming  to  perform 
on  conventional  sequential  computers  have  motivated 
computer  architects  to  consider  parallel  computer  de¬ 
sign.  To  help  readers  understand  different  kinds  of  par¬ 
allel  computers,  we  explore  various  parallel  computers 
available  by  examining  special  features  of  each  com¬ 
puter. 

This  paper  is  divided  into  three  main  parts.  The  first 
part  discusses  some  general  considerations  involved  in 
the  design  of  parallel  processing,  including  architec¬ 
tures  and  interconnection  networks.  The  second  part 
outlines  the  main  features  of  the  surveyed  computers. 
They  are  arranged  in  alphabetical  order  by  the  name 
of  the  computer.  An  appendix,  compares  the  surveyed 
computers  by  number  of  processors,  processing  power, 

’This  work  >u  supported  by  the  Defente  Advanced  Research 
Projects  Agency  under  contract  number  N00039-84-C-0211 


memory  size,  architecture,  availability,  chip  used,  and 
software  environment. 

The  DARPA  image  understanding  architecture 
workshop  (1986)  provided  detail  information  on  the 
performance  of  some  of  the  following  systems  on  IU 
’’benchmark"  problems. 

2  General  Considerations 

2.1  Architectures 

Most  computer  architectures  for  parallel  processing  can 
be  classified  as  either  single  instruction  multiple  data 
(SIMD)  structures  or  multiple  instructions  multiple 
data  (MIMD)  structures.  A  few  computers  have  mul¬ 
tiple  SIMD  units  that  can  operate  independently,  and 
they  are  -  lassified  as  multiple  SIMD  (MSIMD)  struc¬ 
tures. 

A  SIMD  computer  typically  consists  of  N  processing 
units,  an  interconnection  network,  and  a  control  unit. 
Each  processing  unit  has  its  own  processor  and  mem¬ 
ory.  The  interconnection  network  provides  interpro¬ 
cessor  communication.  The  control  unit  broadcasts  in¬ 
structions  to  the  processors.  Each  active  processor  ex¬ 
ecutes  the  same  instruction  simultaneously,  using  data 
taken  from  its  own  memory. 

A  MIMD  computer  consists  of  a  set  of  independent 
processing  units  and  an  interconnection  network.  Each 
processing  unit  has  an  independent  instruction  stream 
and  operates  on  independent  data.  The  interconnec¬ 
tion  network  provides  communication  between  proces¬ 
sors. 

2.2  Interconnection  Networks 

The  interconnection  network  is  crucial  to  running  par¬ 
allel  computers  efficiently.  We  discuss  some  of  the  most 
commonly  used  networks  in  parallel  computers.  They 
include  cross-bar,  pipelined,  bus,  K-dimensional  hyper¬ 
cube,  Omega,  and  grid  networks. 

A  cross-bar  network  allows  any  processor  to  access 
data  from  all  other  processors.  Thus,  there  is  never 


644 


contention  for  communications  resources.  This  scheme 
requires  N 2  switches  to  connect  N  processors.  How¬ 
ever,  it  becomes  too  costly  to  implement  when  the  num¬ 
ber  of  processors  is  large. 

The  pipeline  or  one-dimensional  systolic  array  net¬ 
work  is  a  simple  one-dimensional  nearest-neighbor  con¬ 
nection.  The  function  of  each  processing  unit  is  speci¬ 
fied  by  the  host  according  to  a  specific  problem  or  class 
of  problems.  Once  set  up,  each  processing  unit  per¬ 
forms  the  same  operation  on  every  datum  sequentially. 
The  cost  of  this  network  is  proportioned  to  the  number 
of  processors. 

A  bus  network  uses  a  common  shared-data  channel 
for  communication.  With  a  bus,  any  processor  may 
transmit  data  to  other  processors  through  the  common 
data  channel.  However,  since  only  one  processor  can 
transmit  at  a  time,  bus  contention  becomes  the  major 
problem  of  this  network.  Cache  and  local  memories  de¬ 
crease  load  on  a  bus  and  extend  utility  to  processors. 
Bus-oriented  architectures  are  prominent  in  commer¬ 
cial  systems. 

A  K-dimensional  hypercube  network  connects  N  = 
2k  processors.  Processors  can  be  viewed  as  corners 
of  a  cube  in  K-dimensional  space  with  connections  as 
edges  of  the  cube.  This  network  has  a  high  data  band¬ 
width,  which  grows  as  NlogN,  and  a  low  message  la¬ 
tency  whose  worst  case  is  logN.  This  network  is  adapt¬ 
able  to  other  topologies,  such  as  ring,  tree,  and  two- 
and  three-dimensional  meshes. 

An  N  x  N  Omega  network  consists  of  logN  identical 
stages.  Between  adjacent  stages  is  a  perfect-shuffle  in¬ 
terconnection.  Each  stage  has  N/2  switch  boxes  under 
independent  box  control.  The  bandwidth  is  linear  in  N 
and  the  latency  is  logarithmic  in  N. 

A  two-dimensional  nearest-neighbor  network  allows 
communication  between  nearest  neighbors  arranged  in 
a  grid.  It  is  particularly  suitable  for  SIMD  architecture. 
However,  communication  between  distant  processors  is 
expensive  (proportional  to  the  square  root  of  the  num¬ 
ber  of  processors).  The  complexity  of  the  network  in¬ 
creases  proportionally  according  to  the  number  of  pro¬ 
cessors.  This  is  a  common  architecture,  augmented  in 
some  machines  by  other  interconnection  for  distant  pro¬ 
cessors. 

3  Surveyed  Computers 

3.1  Balance 

The  Balance  21000  parallel  computer  system  is  a 
MIMD  system  offered  by  Sequent  Computer  System 
Inc..  It  provides  from  four  to  thirty  tightly-coupled 
32-bit  microprocessors  with  performance  which  ranges 


from  2.8  to  21  MIPS.  Up  to  256  users  can  be  supported 
simultaneously. 

National  NS32032  microprocessors  are  used.  Each  is 
supported  by  a  floating-point  unit,  a  memory  manage¬ 
ment  unit,  and  a  8-Kbyte  cache.  The  cache,  together 
with  its  interrupt  control  logic,  minimizes  bus  con¬ 
tention  among  processors.  This  results  in  linear  speed- 
ups  on  the  system  as  additional  processors  are  added. 
The  whole  system  can  deliver  up  to  2.25  MFLOPS. 

All  processors  share  a  common  memory  pool  and  ,i 
single  copy  of  DUNIX,  a  parallel  version  of  the  UNIX 
operating  system.  The  memory  is  interleaved  and  con¬ 
tains  up  to  48  Mbytes.  It  is  equally  accessible  by  all 
processors  and  peripherals. 

A  proprietary  32-bit  system  bus  connects  all  proces¬ 
sors  and  memories.  It  can  sustain  an  effective  data 
transfer  rate  of  up  to  26.7  Mbyte/second.  The  I/O 
subsystems  can  be  Ethernet  or  Multibus  interfaces. 

The  DUNIX  operating  system  dynamically  balances 
the  load  of  the  system  and  distributes  multiuser  work¬ 
loads  evenly  across  processors  for  maximum  system 
throughput.  It  allows  processes  to  access  mutually  ex¬ 
clusive  hardware  directly. 

Sequent  has  a  software  library  that  supports  parallel 
programming  in  C,  FORTRAN,  Pascal,  and  Ada. 

3.2  Butterfly 

The  Butterfly  Parallel  Processing  Computer  is  a  prod¬ 
uct  of  Bolt  Beranek  and  Newman  Inc..  The  basic  con¬ 
figuration  of  this  MIMD  computer  starts  with  four  pro¬ 
cessors;  it  can  be  expanded  in  single  processor  incre¬ 
ments  to  256  processors  which  yield  up  to  250  MIPS. 
It  has  a  butterfly  switch  for  interconnecting  all  proces¬ 
sors  described  below. 

Each  processor  and  its  memory  are  located  on  a  sin¬ 
gle  board  called  a  Processor  Node  (PN),  which  is  the 
basic  computing  element  of  the  Butterfly  computer. 
Each  PN  uses  a  standard  Motorola  MC68020  micro¬ 
processor,  capable  of  executing  one  million  instructions 
per  second.  The  microprocessor  is  augmented  by  the 
MC68881  floating  point  computational  unit.  Each  PN 
also  contains  a  microcoded  control  processor  that  pro¬ 
vides  intcrprocessor  communication,  synchronization, 
and  support  for  parallel  processing. 

Each  node  may  have  memory  size  from  one  Mbyte 
to  four  Mbytes,  with  a  maximum  system-wide  shared 
memory  of  one  Gbyte.  It  can  independently  execute 
its  own  sequence  of  instructions,  referencing  data  as 
specified  by  the  instructions.  Though  memory  is  local 
to  the  PNs,  each  processor  can  access  remotely  any 
memory  in  the  system  using  the  Butterfly  switch. 

The  Butterfly  switch  implements  interprocessor  corn- 


645 


munication  using  techniques  similar  to  packet  switch¬ 
ing.  Its  topology  is  similar  to  that  of  the  Fast  Fourier 
Transform  Butterfly  and  it  implements  a  subset  of 
the  hypercube  network.  The  switch  has  a  latency  of 
logN  and  the  bandwidth  through  each  processor-to- 
processor  path  in  the  switch  is  32  Mbits/second. 

The  Butterfly  I/O  system  is  distributed  among  the 
PNs.  Any  node  can  have  an  I/O  board  that  supports 
data  transfer  at  a  maximum  rate  of  two  Mbytes/second. 

Butterfly  application  software  development  is  done 
on  a  VAX  or  Sun  work  station  running  Berkeley  4.2 
UNIX.  The  system  supports  C'  and  FORTRAN  77. 
A  multiprocessing  Common  Lisp  system,  that  will  be 
compatible  with  the  Symbolic  3600  scries  of  Lisp  work 
stations,  is  under  development. 

3.3  CAPP 

The  Content  Addressable  Parallel  Processor  (CAPP) 
:  a  SIMD  computer  built  at  the  University  of  Mas¬ 
sachusetts.  Its  goal  is  to  explore  the  applications  of 
content  addressability  and  parallelism.  The  computer 
consists  of  two  main  parts — the  central  control  unit  and 
the  parallel  processor. 

The  central  control  unit  is  responsible  for  broadcast¬ 
ing  instructions  to  the  parallel  processor  and  for  load¬ 
ing  and  unloading  data  in  the  processor.  It  also  serves 
as  an  interface  between  the  host  and  secondary  data 
storage  devices.  The  central  control  unit  is  pipelined, 
and  instructions  can  be  prefetched.  It  contains  a  ROM 
with  commonly  needed  micro-coded  instructions  and  a 
small  program  memory  for  storing  user  programs. 

The  parallel  processor  contains  262,144  cells  ar¬ 
ranged  as  a  512  x  512  array.  Each  cell  consists  of  32 
bits  of  static  memory,  an  ALU,  and  four  one-bit  static 
tags.  One  of  the  tags  controls  whether  the  cell  is  active. 

Cells  are  connected  to  their  nearest  four  neighbors. 
Because  the  pin  count  on  the  chip  is  limited,  an  8:1 
multiplexing  scheme  was  used  for  communication  be¬ 
tween  chips.  Cells  on  the  edge  are  processed  in  three 
ways,  first  as  dead  edges  connecting  no  neighbor,  sec¬ 
ond  wrapped  around,  or  third  wrapping  around  but 
offset  by  one  cell,  t  j  make  a  linear  array. 

3.4  CLIP 

Cellular  Logic  array  for  Image  Processing,  CLIP,  is  a 
family  of  array  processors  developed  at  University  Col¬ 
lege,  London.  CLIP7  is  the  most  recent  system  con¬ 
structed  in  the  CLIP  family.  It  has  improved  image 
resolution  to  512  x  512  pixels  as  compared  to  96  x  96 
pixels  in  CLIP4.  CLIP7  is  implemented  as  a  scanned 
array  of  512  x  4  processors.  Each  processor  deals  with 
data  from  128  pixels  during  scanning. 


The  original  CLIP4  chip  has  a  bit-serial  circuit  that 
is  flexible  but  lacks  individual  power.  The  CLIP,  is 
based  on  a  custom-designed  integrated  circuit,  which 
includes  a  16-bit  ALU  running  at  lOOns/cycle.  This 
multibit  circuit  is  more  appropriate  than  the  bit-serial 
circuit  for  image  processing  on  grey  scale  data. 

Each  processor  has  128  x  256  bits  of  storage,  orga¬ 
nized  as  4,096  bytes.  A  local  register  allows  the  con¬ 
struction  of  a  system  in  which  each  processor  in  the 
array  has  a  degree  of  local  autonomy.  Externally  ap¬ 
plied  signals  control  normal  chip  operation.  However, 
the  content  of  a  local  register  can  affect  operation  if 
conditional  operation  is  selected. 

The  CLIP7  chip  has  a  single  processor.  All  data  lines 
are  available  externally,  in  particular  the  neighborhood 
interconnection  lines.  This  means  it  is  possible  to  build 
assemblies  of  processors  having  various  connection  net¬ 
works. 

Data  buses  in  the  chip  are  16  bits  wide,  whereas  ex¬ 
ternal  data  accesses  are  through  8-bit  ports.  Neigh¬ 
borhood  interconnections  between  processors  use  single 
lines  because  of  packaging  requirements.  Serial  trans¬ 
mission  of  propagation  data  imposes  a  10 

All  user  interactions  with  the  CLIP7  system  are 
through  the  minicomputer  host,  which  runs  the  UNIX 
operating  system.  The  host  supports  IPO  and  C  pro¬ 
gramming  languages.  A  control  pipeline  receives  infor¬ 
mation  from  the  host.  Pipelining  permits  overlapped 
fetching  and  execution  of  instructions. 

CLIP7  is  provided  with  a  direct  video  data  input 
channel  via  TV  camera,  A/D  converter,  and  a  frame 
store.  It  is  connected  to  a  Winchester  disc  for  data 
storage.  Its  effective  bandwidth  is  between  20  and  40 
Mbvtes/sccond,  allowing  storing  or  retrieving  an  image 
in  100ms. 

3.5  Connection  Machine 

Thinking  Machines  Corp.  builds  the  Connection  Ma¬ 
chine  computer,  which  has  65,536  processors  providing 
a  raw  computing  power  of  1000  MIPS.  The  computer 
can  be  configured  to  a  much  larger  number  of  logical 
processors  by  creating  a  two-dimensional  array  of  vir¬ 
tual  processors  on  each  physical  processor. 

The  computer  can  be  divided  into  four  equal  sub¬ 
parts.  Each  works  at  a  separate  front  end,  and  each  has 
a  separate  instruction  stream  executing  in  a  quarter  of 
the  system’s  processors.  Thus  it  becomes  a  MSIMD 
computer. 

Each  processor  is  a  bit-serial  processor  that  operates 
on  two  data  values  specified  by  each  instruction.  It  has 
four  Kbits  of  memory  (32  Mbytes  for  the  computer). 
The  memory  is  divided  into  a  data  area  and  a  stack 


645 


area,  and  it  is  bit  addressable.  Sixteen  physical  proces¬ 
sors  are  grouped  on  a  chip. 

Two  ways  of  communication  between  processors  are 
available.  First,  each  processor  is  wired  to  its  neigh¬ 
bors  to  the  North,  East,  West,  and  South  by  the 
NEWS  network.  Second,  a  "  Boolean  n-cube"  network, 
the  Connection  Machine  Router,  connects  each  of  the 
65,536  physical  processors  to  16  other  physical  proces¬ 
sors  whose  binary  addresses  are  different  in  just  one  of 
the  16  bits.  The  Router  allows  a  full  message  to  be  sent 
from  any  processor  to  any  other.  The  sender  processor 
simply  needs  to  know  the  address  of  the  destination 
processor. 

Data  are  exchanged  between  memory  and  the  front 
end  in  three  ways.  ’’Read-slice”  reads  a  single  bit  from 
the  memory  of  each  of  a  series  of  consecutive  proces¬ 
sors.  "Read-processor”  moves  a  single  field  between  the 
front  end  and  a  single  processor.  ”  Read-array”  moves 
fields  between  the  front  end  and  a  set  of  contiguous 
processors. 

A  microcontroller  expands  macro-instructions  from 
the  front  end  into  nano-instructions,  which  are  broad¬ 
cast  to  all  virtual  processors.  Processors  have  the  op¬ 
tion  of  "sitting  out"  some  instructions  depending  on 
the  one-bit  Context  Flag  in  each  processor. 

The  Connection  Machine  provides  an  assembly-level 
language  REL-2.  It  also  supports  parallel  versions  of 
C  and  Lisp. 

3.6  FAIM-1 

FAIM-l  is  a  multiprocessing  system  developed  to  sup¬ 
port  concurrent  symbolic  program  development  and  ex¬ 
ecution.  It  consists  of  many  autonomous  processing 
elements,  called  Hectogons,  arranged  in  a  hexagonal 
mesh.  A  19-processor  prototype  computer  is  under  de¬ 
velopment  at  Schlumberger  Palo  Alto  Research. 

Each  Hectogon  has  six  subsystems  connected  by  the 
System  Bus  (SBUS).  The  Evaluation  Processor  (EP) 
is  a  stack-based  processor  responsible  for  evaluating 
computer  instructions.  The  Switching  Processor  (SP), 
which  is  responsible  for  context  switching,  sets  up  the 
next  runnable  task,  while  the  EP  evaluates  the  cur¬ 
rently  active  task.  The  Instruction  Stream  Memory 
(ISM)  stores,  decodes,  and  handles  instructions  to  the 
EP.  It  runs  concurrently  with  the  EP  and  effectively 
provides  a  two-stage  pipeline  mechanism.  The  Scratch 
Random  Access  Memory  (SRAM)  is  a  four-ported  lo¬ 
cal  data  memory,  which  provides  concurrent  access  to 
the  EP,  SP,  SBUS  and  Post  Office.  A  parallel  asso¬ 
ciative  memory  for  matching  structures  is  the  Pattern 
Addressable  Memory  (PAM)  which  stores  and  matches 
S-expression  structures  of  symbols  and  words.  Physi¬ 


cal  delivery  of  inter-Hectogon  messages  is  carried  oui 
through  the  Post  Office  subsystem.  It  delivers  messages 
concurrently  with  program  execution  in  the  processor, 
detects  congestion  dynamically,  and  makes  its  routing 
decisions  accordingly. 

Each  Hectogon  communicates  by  sending  messages. 
Fault-tolerant  message  routing  is  made  possible  by  the 
multiplicity  of  paths  over  which  a  message  may  be 
routed  to  its  destination. 

The  FAIM-1 's  communication  topology  consists  of 
two  levels.  The  lower  level  has  a  number  of  Hectogons 
interconnected  to  form  processing  surfaces.  These  sur¬ 
faces  in  turn  can  be  interconnected  to  form  a  multi- 
surface  configuration  on  the  higher  level. 

W'ithin  each  surface  ;n  the  lower  level,  the  Hectogons 
are  arranged  in  a  regular  hexagonal  mesh.  When  wire 
leave  a  processing  surface  at  the  periphery,  they  are 
folded  back  onto  the  surface  using  a  three-axis  variant 
of  a  twisted  torus.  This  wrapping  scheme  provides  a 
minimal  switching  diameter  for  a  hexagonal  mesh.  For 
a  19-processor  surface  the  worst-case  communication 
requires  at  most  two  hops. 

Multiple  surfaces  can  also  be  interconnected  using  a 
hexagonal  mesh.  Multi-surface  configuration^  have  the 
advantage  of  increased  on-surface  locality.  The  com¬ 
munication  diameter  of  the  entire  system  is  decreased 
when  compared  to  a  single-surface  instance  of  a  sim¬ 
ilar  number  of  processors.  A  58,381-processor  com¬ 
puter,  arranged  on  a  hexagon  of  side  MO,  has  a  di¬ 
ameter  of  139.  On  the  other  hand,  a  58,807-processor 
computer,  arranged  on  a  9-surface  configuration,  each 
surface  forming  a  hexagon  of  side  10,  has  a  diameter 
of  89.  Thus,  using  a  multi-surface  configuration,  the 
communication  diameter  is  reduced  by  50. 

3.7  FX/series 

Alliant  Computer  System  Corp.  offers  the  FX/series 
parallel  processing  system.  It  can  be  expanded  in  the 
field  to  provide  9-1  MFLOPS  peak  performance. 

The  architecture  of  the  FX-8  system  includes  two 
main  resource  classes — Interactive  Processors  (IPs)  and 
the  computational  complex. 

The  IP  runs  interactive  user  jobs  and  the  operating 
system  in  parallel.  It  maintains  system  responsiveness 
and  frees  the  computational  complex  to  concentrate  on 
compute-intensive  applications.  Each  IP  is  a  Multibus 
containing  a  virtual  memory  address  translation  unit, 
an  I/O  map,  and  both  a  console  and  a  remote  diagnos¬ 
tic  serial  port. 

The  IP  interfaces  with  the  IP  cache,  which  provides 
access  to  global  memory.  Each  IP  has  512  Kbytes  of 
local  memory  and  accesses  global  memory  through  a 


647 


virtual  memory  architecture.  Virtual  address  space  for 
users  is  two  Gbytes.  Up  to  12  IPs  can  be  configured  as 
interactive  load  increases. 

The  computational  complex  consists  of  up  to  eight 
processors,  called  Computational  Elements  (CEs). 
Each  processor  can  work  independently  and  delivers 
11.8  MFLOPS  peak  performance.  Because  the  hard¬ 
ware  schedules  and  synchronizes  multiple  CEs,  the  per¬ 
formance  speedup  delivered  to  a  single  application  ap¬ 
proaches  the  number  of  CEs  installed. 

The  CP  cache  serves  as  an  interleaved  high-speed 
physical  memory  buffer  for  the  computational  complex. 
Two  cache  modules  with  four  cache  ports  provide  a 
four-way  interleaved  128-Kbyte  cache  with  a  maximum 
bandwidth  of  376  Mbytes/second.  A  crossbar  intercon¬ 
nect  dynamically  connects  up  to  eight  computational 
elements  with  up  to  four  cache  ports. 

Physical  memory  in  Alliant  systems  uses  256  Kbytes 
of  dynamic  RAM  and  is  expandable  in  8-Mbyte  mod¬ 
ules  up  to  a  maximum  of  64  Mbytes. 

The  Concentrix  operating  system  implements  the 
Berkeley  4.2  UNIX  operating  system.  It  supports  a 
multiuser  environment  and  parallel  processing  without 
programmer  intervention.  The  system  manages  two 
types  of  jobs  and  dynamically  schedules  them  on  avail¬ 
able  processors  as  long  as  work  remains.  Compute¬ 
intensive  jobs  take  priority  on  the  computational  com¬ 
plex.  Interactive  user  jobs,  I/O,  and  other  operating 
system  activities  are  scheduled  on  any  available  IP  or 
otherwise  idle  computational  complex. 

Concentrix  supports  FX/FORTRAN,  Pascal,  C,  and 
Alliant  assembler.  The  FX/FORTRAN  fully  imple¬ 
ments  the  FORTRAN  77  programming  language.  C 
and  Pascal  languages  are  supported  only  in  a  single 
computational  element. 

The  FX/FORTRAN  compiler  identifies  those  sec¬ 
tions  of  code  that  can  be  executed  concurrently  by 
multiple  CEs  during  compilation.  It  optimizes  a  loop  as 
much  as  possible,  but  suppresses  optimization  wherever 
the  optimized  c^-de  might  produce  results  that  differ 
from  the  unoptimized  code.  Thus,  it  requires  no  source 
code  reprogramming  from  the  user.  The  compiler  in¬ 
serts  special  concurrency  control  instructions  into  the 
program  stream  to  identify  these  sections.  This  con¬ 
currency  is  self-scheduling  and  controlled  by  hardware 
at  execution  time.  Additional  CEs  can  be  added  with¬ 
out  recompiling  or  relinking  of  programs.  The  hard¬ 
ware  concurrency  control  allows  a  program  to  initiate, 
synchronize,  and  suspend  concurrent  processing  with 
minimum  overhead. 


3.8  GAPP 

Geometric  Array  Parallel  Processor,  GAPP,  was  orig¬ 
inally  developed  by  the  Martin-Marietta  Corp.  and 
marketed  commercially  by  NCR  Corp.’s  Microelectron¬ 
ics  Division.  It  is  a  SIMD  processor  with  6  x  12  pro¬ 
cessing  elements  (PEs). 

Each  of  the  PEs  contains  an  1-bit  ALU,  128  bits 
of  RAM,  and  four  registers.  It  operates  with  a  10- 
MHz  clock  and  takes  25  clock  cycles  to  add  two  8-bit 
numbers.  With  all  the  PEs  running,  the  processor  can 
deliver  921  million  additions  per  second. 

72  PEs  are  arranged  in  a  6  x  12  array  in  the  current 
version  of  the  chip.  The  chip  is  fabricated  with  a  3-um 
double-layer  meted  CMOS  process,  and  it  is  currently 
housed  in  a  ceramic  84-lead  pin-grid  array. 

Connecting  each  PE  to  its  neighbors  on  the  north, 
south,  east,  and  west  are  bidirectional  communication 
lines.  In  addition,  a  separate  I/O  communication  bus 
allows  data  to  be  input  from  the  south  end  of  the  ar¬ 
ray  and  output  to  the  north  without  interfering  with 
computations  within  the  ALU. 

The  implementation  of  a  control  store  lets  the  pro¬ 
cessor  receive  a  set  of  instructions  from  the  host  and 
store  them,  freeing  the  host  for  other  tasks.  The  control 
store  operates  in  conjunction  with  a  sequencer,  which 
watches  for  and  maintains  the  correct  sequence  as  the 
processor  performs  its  instructions. 

The  processor  is  programmed  with  a  sequence  of  in¬ 
structions  that,  when  compiled  by  an  assembler,  directs 
the  appropriate  control  signals  to  every  cell  in  the  ar¬ 
ray.  Up  to  five  commands  (four  for  each  of  the  four 
registers  and  one  for  the  RAM)  can  be  executed  simul¬ 
taneously  on  every  instruction  cycle.  A  software  library 
of  macTO-cells  forms  the  basis  of  a  high-level  command 
set  for  the  processor. 

3.9  iPSC 

Intel  Scientific  Computers  offers  iPSC  as  a  multipro¬ 
cessor  system  that  can  operate  concurrently  with  as 
many  as  128  independent  processing  units  connected 
as  a  hypercube  network. 

The  iPSC  system  consists  of  two  major  functional 
elements — the  cube  and  the  cube  manager. 

Each  node  of  the  cube  is  a  board-level  micro¬ 
computer  with  high-speed  versions  of  the  Intel 
80286/80287  microcomputer  chip  sets.  It  has  its  own 
memory  of  512  Kbytes,  expandable  to  4.5  Mbytes. 
Each  node  also  contains  eight  bidirectional  communi¬ 
cation  channels  managed  by  dedicated  communication 
co-processors.  Seven  of  these  channels  are  physically 
linked  to  other  nodes  and  serve  as  dedicated  channels. 
The  eighth  channel  is  a  global  Ethernet  channel  that 


648 


provides  direct  access  to  and  from  the  Cube  manager 
for  program  loading,  data  I/O,  and  diagnostics.  It  has 
a  10-Mbit/second  channel  for  internode  serial  commu¬ 
nication. 

The  cube  manager  provides  a  high-level  interface.  It 
serves  as  the  local  host  for  the  cube  and  provides  the 
communication/control  software  and  the  system  diag¬ 
nostic  facility.  It  runs  the  XENIX  operating  system, 
a  version  of  UNIX,  with  Lisp,  FORTRAN,  C,  and  as¬ 
sembly  language. 

The  iPSC-VX  is  a  vector  concurrent  system  built 
upon  the  basic  architecture  of  iPSC.  It  couples  a  high- 
performance  vector  processor  to  each  iPSC  process¬ 
ing  node,  thus  yielding  a  peak  performance  of  1,280 
MFLOPS  (on  32-bit  data,  or  424  MFLOPS  on  64-bit 
data)  on  a  64-node  iPSC-VX/d6  version.  Each  node 
consists  of  1.5  Mbytes,  giving  the  whole  system  96 
Mbytes  of  memory. 

3.10  Massively  Parallel  Processor 

The  Massively  Parallel  Processor  (MPP)  is  a  bit-serial 
SIMD  parallel  processing  computer  built  for  the  NASA 
Goddard  Space  Flight  Center  by  Goodyear  Aerospace 
Corp..  It  has  16,384  processing  elements  (PEs). 

The  major  blocks  in  the  MPP  are  the  array  unit, 
array  control  unit,  staging  memory,  program  and  data 
management  unit,  and  interface  to  the  host. 

Logically,  the  array  unit  contains  16,384  PEs  ar¬ 
ranged  in  a  128  row  x  128  column  square  array.  Physi¬ 
cally,  the  array  unit  contains  an  extra  128  row  x  4  col¬ 
umn  rectangle  of  PEs  for  redundancy.  When  a  faulty 
PE  is  discovered,  the  processor  bypasses  all  the  PEs 
in  that  column  (or  row),  and  the  topology  is  not  dis¬ 
turbed. 

Each  PE  is  a  bit-serial  processor  which  uses  a  full 
adder  and  a  shift  register  for  arithmetic.  Each  PE  has 
a  RAM  storing  1,024  bits.  The  address  lines  of  all  PEs 
are  tied  together  so  that  memories  are  accessed  by  a 
bit-plane  with  one  bit  of  a  bit-plane  accessed  by  each 
PE.  The  PE  memory  can  be  expanded  to  65,536  bits 
per  PE  or  to  128  Mbytes  total.  The  basic  cycle  time 
of  the  PE  is  100ns.  With  all  PEs  operating  in  parallel, 
it  delivers  3,000  MOPS  for  integer  addition  and  400 
MFLOPS  for  floating  point  addition. 

Each  PE  communicates  with  its  four  nearest  neigh¬ 
bors.  The  edge  connection  is  programmable.  Between 
the  top  and  bottom  edges  of  the  array  unit,  one  can  ei¬ 
ther  connect  them  together  to  make  the  array  look  like 
a  cylinder  or  separate  them  to  make  that  array  look 
like  a  plane.  Similarly,  the  left  and  right  edges  can  be 
independently  connected  together,  or  separated.  When 
the  left  and  right  edges  are  connected  together,  one  can 


either  connect  corresponding  rows  together  or  slide  the 
connection  by  one  row  so  the  left  PE  of  row  i  com¬ 
municates  with  the  right  PE  row  i+1.  Thus  rows  are 
connected  together  in  a  spiral  fashion  like  a  long  linear 
string. 

The  array  control  unit  has  three  subunits — the  PE 
control  unit  (PCU)  controls  processing  in  the  array 
unit,  the  I/O  control  unit  manages  flow  of  I/O  data 
through  the  array  unit,  and  the  main  control  unit 
(MCU)  runs  application  programs,  performs  necessary 
scalar  processing,  and  controls  the  other  two  subunits. 
The  division  of  responsibility  allows  array  processing, 
scalar  processing,  and  I/O  to  proceed  simultaneously. 

The  staging  memory  buffers  and  reorders  the  bit- 
serial  format  of  the  array  unit  to  the  word  format  of 
the  outside  world.  Data  can  be  input  at  a  rate  of  160 
Mbytes/second.  An  I/O  rate  of  320  Mbytes/second  can 
be  achieved  when  input  and  output  proceed  simultane¬ 
ously. 

The  program  and  data  management  unit  controls  the 
overall  flow  of  program  and  data  in  and  out  of  the  MPP. 
It  also  handles  program  development  and  diagnostics. 
It  is  implemented  on  a  DEC  PDP-11  minicomputer 
operating  under  DEC’s  RSX-llM  real-time  multipro¬ 
gramming  system. 

The  software  consists  of  two  assemblers  (one  each  for 
the  PCU  and  the  MCU),  a  system  subroutine  library, 
a  set  of  I/O  macros,  a  control  and  debug  module,  and 
a  linker.  Additionally,  a  parallel  version  of  Pascal  is 
available. 

3.11  Multimax 

The  Multimax  multiprocessor  computer  system  is  of¬ 
fered  by  Encore  Computer  Corp..  The  system  can  be 
configured  to  use  from  2  to  20  processors.  Its  perfor¬ 
mance  ranges  between  1.5  and  15  MIPS. 

Each  Dual  Processor  Card  (DPC)  holds  two  indepen¬ 
dent  National  NS32032  processors  running  at  10  MHz. 
Each  processor  has  its  own  National  NS32081  floating 
point  unit.  The  DCP  includes  a  32-Kbyte  cache  mem¬ 
ory  unit  which  minimizes  processor  access  latency  to 
data  and  instruction,  and  cuts  bus  traffic. 

There  are  1  to  8  Shared  Memory  Cards  (SMCs) 
per  system.  Each  card  provides  from  4  Mbytes  to  16 
Mbytes  of  shared  memory.  Thus  system-wide  shared 
memory  ranges  from  4  to  128  Mbytes.  The  SMC  has 
two  controllers,  which  serve  the  need  of  processors  in 
parallel.  Sequential  data  transfer  into  and  out  of  mem¬ 
ory  is  speeded  up  by  eight-way  interleaving  memory 
bank. 

A  100-Mbyte/second  Nanobus,  which  uses  advanced 
Schottky  technology  and  has  a  cycle  time  of  80  ns,  is 


649 


provided.  It  supports  the  full-speed  operation  of  all 
processors.  The  Multimax  vector  bus  can  handle  burst 
rates  up  to  1  million  interrupts  per  second.  All  pro¬ 
cessors  are  available  to  respond  to  interrupts,  and  all 
processors  can  initiate  I/O. 

Multimax  supports  two  UNIX  versions — UNIX  4.2 
and  UNIX  V,  called  UMAX  4.2  and  UMAX  V  respec¬ 
tively.  These  operating  systems  are  configured  for  a 
multiprocessing  environment.  They  provide  users  with 
transparent  parallelism  in  a  time-sharing  environment. 
Multimax  also  supplies  parallel  extensions  io  standard 
languages,  a  library  of  parallel  processing  system  calls, 
and  a  parallel  debugger. 

3.12  NON-VON 

NON-YON  is  a  family  of  massively  parallel  computers 
being  developed  at  Columbia  University.  The  NON- 
VON  1  and  NON-VON  3  computers  are  SIMD  comput¬ 
ers.  The  NON-VON  4  is  an  enhanced  version  designed 
as  an  ensemble  of  one  or  more  independent  SIMD  com¬ 
puters  communicating  through  a  high-bandwidth  inter¬ 
connecting  network.  Thus,  NON-VON  4  can  operate 
in  MSIMD  modes. 

All  members  of  the  family  include  a  primary  process¬ 
ing  system  (PPS).  which  consists  of  a  large  number  of 
small  processing  elements  (PEs)  configured  as  a  binary 
tree.  Each  PE  has  64  bytes  of  local  RAM  and  an  I/O 
switch  The  lack  of  memory  in  the  PE  forces  it  to 
load  instructions  from  external  sources.  The  design  of 
NON-VON  also  includes  a  secondary  processing  system 
(SPS)  connected  to  the  PPS  through  a  high-bandwidth 
parallel  interface.  The  SPS  can  inspect  records  "on  the 
fly"  to  determine  whether  they  are  relevant  to  an  oper¬ 
ation  before  transferring  to  the  PI’S.  However,  the  SPS 
has  not  been  implemented  because  of  funding  limita¬ 
tions. 

NON-VON  1  and  NON-VON  3  include  a  control  pro¬ 
cessor  (CP)  at  the  root  of  the  tree.  The  CP  coordinates 
activities  within  the  PPS  and  broadcasts  instructions 
to  be  executed  in  all  active  PEs.  On  the  other  hand, 
NON-VON  4  includes  a  number  of  large  processing  ele¬ 
ment'  (I. PEs),  which  are  capable  of  acting  as  a  CP  for 
a  some  portion  of  the  PPS.  Thus  it  can  handle  multi¬ 
tasking  and  multi-user  applications.  Additional  storage 
is  available  for  LPE  in  NON-VON  4.  This  will  be  useful 
as  swapping  storage  among  local  RAMs  in  the  PEs. 

The  NON-VON  computers  allow  three  modes  of 
communications  between  the  PEs.  They  are  all  sup¬ 
ported  by  the  I/O  switch.  The  global  communication 
mode  allows  the  CP  to  broadcast  instructions  to  all 
PEs  and  each  PE  to  return  results  to  the  CP.  Data  can 
be  transferred  from  one  PE  to  another  through  the  CP 


using  global  communication  mode,  but  no  concurrency 
is  achieved.  The  tree  communication  mode  allows  data 
transfer  among  PEs  that  are  physically  adjacent  within 
the  PPS  binary  tree.  Data  can  be  transferred  to  the 
parent,  left  child  and  right  child.  In  linear  communi¬ 
cation  mode,  the  tree  is  reconfigured  as  a  linear  array 
of  PEs.  Data  can  be  transferred  to  the  left  or  right 
neighbor. 


3.13  PASM 

PASM  is  a  partitionable  SIMD/MIMD  computer  at 
Purdue  University.  It  can  be  dynamically  reconfigured 
to  operate  as  one  or  more  independent  SIMD/MIMD 
computers  of  various  sizes.  The  design  of  PASM  calls 
for  N  =  1,024  processors.  A  16-processor  prototype 
based  on  Motorola  MC68000  processors  is  under  devel¬ 
opment. 

The  basic  components  of  PASM  consist  of  a  parallel 
computation  unit  (PCU),  microcontroller  (MC),  con¬ 
trol  storage,  memory  storage  system,  memory  manage¬ 
ment  system,  and  system  control  unit. 

The  PCU  is  designed  to  have  N  processors,  N  mem¬ 
ory  modules,  and  an  interconnection  network.  The 
PCU  processors  are  microprocessors  that  perform  the 
actual  computations.  The  PCU  memory  modules  are 
used  by  the  PCU  processors  for  data  storage  in  SIMD 
mode  and  both  data  and  instruction  storage  in  MIMD 
mode.  Each  memory  module  has  a  pair  of  memory 
units.  This  double  buffering  scheme  allows  data  to  be 
moved  between  one  memory  unit  and  secondary  stor¬ 
age  while  the  processor  operates  on  data  in  the  other 
memory  unit.  The  interconnection  network  provides 
communication  among  processors. 

The  MC  is  a  set  of  microprocessors  that  act  as  the 
control  units  for  the  PCU  processors  in  SIMD  mode 
and  coordinate  the  activities  of  the  PCU  processors  in 
MIMD  mode.  If  Q  MC's  is  the  maximum  number  of 
partition  allowable,  then  N/Q  is  the  size  of  the  smallest 
partition.  Control  storage  contains  the  program  for  the 
MCs. 

The  memory  storage  system  prov'des  secondary  stor¬ 
age  for  data  files  in  SIMD  mode  and  for  data  and  pro¬ 
grams  files  for  MIMD  mode.  The  memory  management 
system  controls  the  transferring  of  files  between  the 
memory  storage  system  and  the  PCU  memory  mod¬ 
ules.  The  system  control  unit  is  a  conventional  com¬ 
puter  that  coordinates  different  components  of  PASM 
and  that  is  also  used  for  program  development  and  job 
scheduling. 


650 


3.14  PICAP 

The  PICAP  is  a  MIMD  parallel  image  processing  sys¬ 
tem  developed  at  the  Picture  Processing  Laboratory, 
Linkoping  University,  Sweden.  The  computer  has  a 
modular  architecture  designed  to  consist  of  up  to  16  dif¬ 
ferent  processors.  Each  processor  runs  independently 
and  has  a  specialized  function,  such  as  linear  filtering, 
segmentation,  and  image  I/O.  It  is  a  multiuser  system 
that  allows  users  to  share  processors  and  image  mem¬ 
ory  dynamically. 

One  of  the  processors  is  a  SIMD  filter  processor,  FIP, 
that  consists  of  four  subprocessors  operating  in  parallel 
according  to  a  common  microprogram.  Each  subpro¬ 
cessor  is  pipelined  to  increase  its  performance.  A  32- 
Kbyte  local  memory  in  the  FIP  is  partitioned  into  four 
separate  modules  thus  allowing  simultaneous  retrieval 
of  data  by  the  four  subprocessors.  For  neighborhood 
operations,  it  stores  a  horizontal  strip  of  an  image  such 
that  pixels  are  both  horizontally  and  vertically  adjacent 
to  each  other.  Its  peak  performance  is  108  elementary 
operations  per  second  on  8-bit  data. 

All  processors  operate  on  images  stored  in  a  large 
random-access  image  memory  of  four  Mbytes,  inter¬ 
leaved  over  16  separate  memory  modules.  A  40- 
Mbyte/second  time-shared-bus  connects  all  the  proces¬ 
sors  to  the  memory. 

The  computer  is  programmed  in  three  ways.  One  is 
a  n. mu-based  command  language  in  which  commands 
are  executed  directly,  although  it  is  possible  to  store  a 
sequence  of  commands  in  a  file.  A  high-level  highlv- 
interactive  programming  language,  PPL,  designed  for 
the  structure  of  PICAP  is  available.  Each  procedure  of 
PPL  can  be  executed  separately.  PICAP  also  handles 
programs  written  in  FORTRAN  or  Pascal  through  a 
set  of  library  routines. 

3.15  RP3 

Research  Parallel  Processor  Project  (RP3)  was  initi¬ 
ated  at  IBM.  It  is  designed  as  a  parallel  MIMD  com¬ 
puter  for  investigating  both  hardware  and  software  as¬ 
pects  of  parallel  computation.  It  can  be  configured 
as  both  shared  memory  and  local  memory  message¬ 
passing  paradigms,  as  well  as  mixtures  of  the  two. 

RP3  is  designed  to  have  up  to  512  processor/memory 
elements  (PME)  with  an  interconnection  network.  A 
full  system  is  expected  to  provide  up  to  1.3  GIPS, 
800  MFLOPS,  1-2  Gbytes  of  main  storage,  1 02- 
Mbyte  / second  I/O  rate,  and  13-Gbyte/sccond  inter- 
processor  communication  capability. 

Each  PME  contains  a  32-bit  processor,  4  Mbytes  of 
memory,  a  32-Kbyte  cache,  a  floating-point  unit,  and 
an  interface  to  the  I/O  and  Support  Processor  (ISP). 


The  RP3  processor  is  a  proprietary  design  based  on  the 
philosophy  that  all  instructions  should  be  completed  in 
a  single  cycle.  But  unlike  other  RISC  architectures,  it 
has  an  extensive  instruction  set  and  performs  necessary 
interlocks  internally  in  hardware.  The  PME  provides  a 
memory  mapping  function  as  part  of  address  transla¬ 
tion  and  allows  memory  to  be  dynamically  partitioned 
between  global  and  local  memory. 

The  RP3  interconnection  network  is  composed  of  two 
networks.  One  provides  low  latency,  and  the  other  has 
the  ability  to  combine  messages  directed  to  the  same 
memory  location.  The  low  latency  network  is  similar  to 
an  Omega  network  but  provides  dual  source-sink  paths. 

The  ISP  supports  I/O,  monitors  performance,  and 
mediates  system  initialization  and  configuration.  It  is 
an  independently  programmable  processor  containing 
4  Mbytes  of  memory  and  the  same  processor  used  in 
PMEs. 

The  operating  system  for  RP3  will  be  based  on  BSD 
4.2  UNIX.  C,  FORTRAN,  and  possibly  Pascal  will  be 
available  as  high-level  programming  languages. 

3.16  Warp 

Warp  is  a  MIMD  one-dimensional  systolic  array  com¬ 
puter  being  designed  and  built  at  Camiege  Mellon  Uni¬ 
versity.  It  consists  of  ten  identical  cells  in  a  linear  array 
and  delivers  a  peak  performance  of  100  MFLOPS. 

Each  cell  contains  two  floating  point  processors,  one 
ALU,  and  one  multiplier.  The  processors  are  pipelined, 
and  each  can  deliver  up  to  5  MFLOPS. 

An  operand  register  file  is  dedicated  to  each  arith¬ 
metic  processing  units  to  ensure  that  data  can  be  sup¬ 
plied  at  the  rate  they  are  consumed.  Within  each  cell,  a 
crossbar  is  used  to  support  a  high  intra-cell  bandwidth. 

Every  cell  contains  4l\  of  152-bit  word  micro-store 
and  4K  of  32-bit  RAM  as  well  as  other  registers  to 
provide  sufficient  control.  Each  cell  can  be  programmed 
individually  to  execute  a  different  computation. 

Data  flow  through  the  array  on  two  data  paths, 
while  addresses  and  control  signals  travel  on  the  ad¬ 
dress  path.  Each  input  data  path  has  a  queue  to  buffer 
input  data. 

An  interface  ur.it  handles  I/O  between  the  array  and 
the  host  and  performs  data  conversion.  Addresses  for 
data  and  control  signals  are  generated  by  the  interface 
unit  and  are  propagated  from  cel!  to  cell. 

Warp  is  designed  to  interface  with  a  VAX  11/780 
through  an  interface  computer  which  provides  1  Mbyte 
of  memory  and  a  24-Mbyte/second  bandwidth.  The 
host  is  responsible  for  carrying  out  high-level  applica¬ 
tion  routines  and  supplying  data  to  the  Warp. 


651 


4  Further  Information 


Balance  :  Sequent  Computer  Systems  Inc.,  15-150 
S.W.  Roll  Parkway,  Beaverton,  OR.  97006-6063.  Tele¬ 
phone  :  503-626-5700 

Butterfly  :  BBN  Advanced  Computer  Inc.,  10 
Fawcett  Street,  Cambridge,  MA  02238.  Telephone  : 
617-497-3700 

CAPP  :  Department  of  Computer  and  Information 
Science,  University  of  Massachusetts,  Amherst,  MA 
01003. 

CLIP  :  I  mage  Processing  Group,  Department  of 
Physics  and  Astronomy,  University  College  London, 
VVClE  6BT,  United  Kingdom. 

Connection  Machine  :  Thinking  Computer  Cor¬ 
poration,  245  First  Street,  Cambridge,  MA  02142-1214. 
Telephone  :  617-876-1111 

FAIM-1  :  Artificial  Intelligence  Laboratory, 

Schlumberger  Palo  Alto  Research.  3310  Hillview  Av¬ 
enue,  Palo  Alto,  CA  94304. 

FX/Series  :  Alliant  Computer  Systems  Corpora¬ 
tion,  42  Nagog  Park,  Acton,  MA  01720.  Telephone  : 
617-263-9110 

GAPP  :  Microelectronics  Division,  NCR  C'orp.,  Fort 
ColIIins.  CO. 

iPSC  :  Intel  Scientific  Computers,  15201  N.W. 
Greenbrier  Parkway,  Beaverton,  OR  97006.  Telephone 
:  503-629-7600 

MPP  :  Digital  Technology  Department,  Goodyear 
Aerospace  C'orporatioj  ,  Akron.  OH  44315. 

Multimax  :  F.no  re  Computer  Corporation,  257 
Cedar  Hill  Street,  Mar. borough,  MA  01752.  Telephone 
:  617-460-0500 

NON-VON  :  Department  of  Computer  Science, 
Columbia  University,  <ew  York.  NY  10027. 

PASM  :  School  o'  Electrical  Engineering.  Purdue 
University,  West  Lafa;. ette.  IN'  47907. 

PICAP  :  Departi  ’ent  of  Electrical  Engineering, 
Linkoping  University,  S-581  83  Linkoping  University, 
Sweden. 

RP3  :  IBM  T.  J.  vYaston  Research  Center,  York- 
town  Heights,  NY’  10;  18. 

Warp  :  Computer  Science  Department,  C'arnegie- 
Mellon  University,  Pit  sburgh,  PA  1521.3. 


5  Appendix 


Table  of  comparison 


Computer  total 

number  of 

processors 

total 
processors ’ 
power 

total 

memory 

size 

individual 

processor’s 

power 

individual 

processor’s 

memory 

Balance 

30 

21  HIPS 

48  Hbytes 

0.7  HIPS 

8  Kbytes 

Butterfly 

256 

250  HIPS 

1026  Hbytes 

1  HIPS 

4  Hbytes 

CAPP 

262,144 

100  ns/cycle 

1  Hbyte 

100  ns/cycle 

32  bits 

CLIP7 

2,048 

100  ns/cycle 

65  Kbytes 

100  ns/cycle 

256  bits 

Connection 

65,536 

1,000  HIPS 

32  Hbytes 

0.016  HIPS  4 

,096  bits 

FAIH-1 

19 

■/A 

■/ A 

I/A 

I/A 

FX/8 

8 

94  HFLOPS 

64  Hbytes 

11.8  HFLOPS 

8  Hbytes 

GAPP 

2304 

921  HIPS 

36  Kbytes 

0.4  HIPS 

128  bits 

iPSC 

128 

1,280  HFLOPS 

96  Hbytes 

10  HFLOPS 

4.5  Hbytes 

HPP 

16,384 

400  KFLOPS 

128  Hbytes 

25  KFLOPS 

65  Kbytes 

Hnltimax 

20 

15  HIPS 

128  Hbytes 

0.7  HIPS 

32  Kbytes 

I0I-V0I 

I/A 

I/A 

■/ A 

I/A 

I/A 

PASH 

1024 

I/A 

■/A 

■/ A 

I/A 

PICAP 

16 

100  HIPS 

4  Hbytes 

100  HIPS 

32  Kbytes 

RP3 

512 

1.6  HFLOPS 

2  Gbytes 

800  HFLOPS 

4  Kbytes 

Harp 

10 

100  HFLOPS 

640  Kbytes 

10  HFLOPS 

64  Kbytes 

Table  of  comparison  (continued) 


Computer 

Archi¬ 

tecture 

Harket 

avail. 

Running 

version 

Chip 

used 

Operating 

system 

Programming 

Language 

Balance 

HIHD 

Yes 

Yes 

IS32032 

unix-based 

C,  FORTRAI, 

Pascal,  Ada 

Butterfly 

HIHD 

Yes 

Yes 

HC68020 

Chrysalis 

C,  FORTRAI, 

Pascal,  Lisp 

CAPP 

SIHD 

lo 

■o 

custom 

host-based 

I/A 

CLIP7 

SIHD 

■o 

Yes 

custom 

host-based 

C,  IPC 

Connection 

HSIKD 

Yes 

Yes 

custom 

host-based 

C,  Lisp 

FAH-1 

HIHD 

lo 

lo 

l/A 

I/A 

I/A 

653 


■  ■*#&**•*>  ■* 


C,  FORTIUS,  Pascal,  assembler 


FX/8 

HIHD 

Yes 

GAPP 

SIHD 

lo 

iPSC 

HIHD 

Yes 

HPP 

SIHD 

■  o 

Unit i max 

HIHD 

Yes 

■OI-VOI 

HIHD 

■o 

PASH 

HIHD 

lo 

ICAP 

HSIHD 

■o 

RP3 

HIHD 

lo 

Varp 

HIHD 

■o 

Yes 

custom 

umix-based 

Yes 

custom 

host-based 

Yes 

In80286 

unix-based 

Yes 

custom 

host-based 

Yes 

IS32032 

unix-based 

lo 

■  /A 

I/A 

■o 

I/A 

I/A 

Yes 

custom 

host-based 

■o 

custom 

unix-based 

Yes 

custom 

host-based 

C,  assembler 

C,  Fortran,  Lisp,  assembler 

1/A 

C,  Fortran,  Pascal,  Ada,  Lisp 

I/A 

I/A 

Fortran,  Pascal,  assembler 
C,  Fortran,  Pascal 

I/A 


Evidence  Combination  Using  Likelihood  Generators 


David  Sher 

Computer  Science  Department 
The  University  of  Rochester 
Rochester,  New  York  14627 


Abstract 

Here,  I  address  the  problem  of  combining  output  of  several 
detectors  for  the  same  feature  of  an  image  I  show  that  if  the 
detectors  return  likelihoods  I  can  robustly  combine  their 
outputs.  The  combination  has  the  advantages  that: 

•  The  confidences  of  the  operators  in  their  own  reports  are 
taken  into  account.  Hence  if  an  operator  is  confident  about  the 
situation  and  the  others  are  not,  then  the  reports  of  the 
confident  operator  dominate  the  decision  process. 

•  A  priori  confidences  in  the  different  operators  can  be  taken 
into  account. 

•  The  work  to  combine  "S’  operators  is  linear  in  'N\ 

This  theory  has  been  applied  to  the  problem  of  boundary 
detection  Results  from  these  tests  are  presented  here. 

1.  Introduction 

Often  in  computer  vision  one  has  a  task  to  do  such  as 
deriving  the  boundaries  of  objects  in  an  image  or  deriving 
the  surface  orientation  of  objects  in  an  image.  Often  one 
also  has  a  variety  of  techniques  to  do  this  task.  For 
boundary  detection  there  are  a  variety  of  techniques  from 
classical  edge  detection  literature  [Ballard82]  and  the  image 
segmentation  literature  e.g.  [Ohlander79],  For  determining 
surface  orientation  there  are  techniques  that  derive  surface 
orientation  from  intensities  [Horn70]  and  texture  [Ikeuchi80] 
[Aloimonos85],  These  techniques  make  certain  assumptions 
about  the  structure  of  the  scene  that  produced  the  data. 
Such  techniques  are  only  reliable  when  their  assumptions 
are  met.  Here  I  show  that  if  several  algorithms  return 
likelihoods  I  can  derive  from  them  the  correct  likelihood 
assuming  at  least  one  of  the  algorithms’  assumptions  are 
met.  Thus  I  derive  an  algorithm  that  works  well  when  any 
of  the  individual  algorithms  works  well. 

The  mathematics  here  were  derived  independently  but 
are  similar  to  the  treatment  in  [Good50].  and  [Good83], 
using  different  notation.  To  understand  my  results  first  one 
must  understand  the  meaning  of  likelihood. 

2.  Likelihoods 

In  this  paper  I  call  the  assumptions  that  an  algorithm 
makes  about  the  world  a  model.  Most  models  for  computer 
vision  problems  describe  how  configurations  in  the  real  world 
generate  observed  data.  Because  imaging  projects  away 
information,  the  models  do  not  explicitly  state  how  to  derive 


the  configuration  of  the  real  world  from  the  sensor  data.  As  a 
result,  graphics  problems  are  considerably  easier  than  vision 
problems.  Programs  can  generate  realistic  images  that  no 
program  can  analyze. 

Let  0  be  the  observed  data,  f  a  feature  of  the  scene 
whose  existence  we  are  trying  to  determine  (like  a  boundary 
between  two  pixels)  and  M  a  model.  Many  computer  vision 
problems  can  be  reduced  to  finding  the  probability  of  the 
feature  given  the  model  and  the  data,  P(f\0&M ).  However 
most  models  for  computer  vision  instead  make  it  easy  to 
compute  P(0\f&M).  I  call  PIO\f&M)  (inspired  by  the 
statistical  literature)  the  likelihood  of  f  given  observed  data 
O  under  M.  As  an  example  assume  f  is  "the  image  has  a 
constant  intensity  before  noise”.  M  says  that  the  image  has 
a  normally  distributed  uncorrelated  (between  pixels)  number 
added  to  each  pixel  (the  noise).  Calculating  PiO\M&p  is 
straight-forward  (a  function  of  the  mean  and  variance  of  0>. 

A  theorem  of  probability  theory,  Bayes’  law,  shows  how 
to  derive  conditional  probabilities  for  features  from 
likelihoods  and  prior  probabilities.  Bayes’  law  is  shown  in 
equation  1. 

mo*m= _ mmmm _ (1) 

f  is  the  feature  for  which  we  have  likelihoods.  M  is  the 
domain  model  we  are  using.  P(0\f&M )  is  the  likelihood  of  f 
under  M  and  P(f]M)  is  the  probability  under  M  of  f. 

Another  important  use  for  explicit  likelihoods  is  for  use 
in  Markov  random  fields.  Markov  random  fields  describe 
complex  priors  that  can  capture  important  information. 
Several  people  have  applied  Markov  random  fields  to  vision 
problems  [Geman84],  Likelihoods  can  be  used  in  a  Markov 
random  field  formulation  to  derive  estimates  of  boundary 
positions  [Marroquin85a]  [Chou87J.  In  [Sher86]  and 
[Sher87]  I  discuss  algorithms  for  determining  likelihoods  of 
boundaries. 

Let  us  call  an  algorithm  that  generates  likelihoods  a 
likelihood  generator.  Consider  likelihood  generators  L\  and 
L2  with  models  ,Vf1  and  .Vf2  and  assume  they  both  determine 
probability  distributions  for  the  same  feature.  L,  can  be 
considered  to  return  the  likelihood  of  a  label  l  for  feature  f 
given  observed  data  0  and  the  domain  model  M\.  Thus  L, 
calculates  P(0\f-lAM  t).  Also  La  calculates 
P(0\f=l&M  ]).  A  useful  combination  of  L  i  and  La  is  the 
likelihood  detector  that  returns  the  likelihoods  for  the  case 
where  Aft  or  Afj  is  true.  Also  the  prior  confidences  one  has 
in  M|  and  should  be  taken  into  account. 


558 


This  paper  studies  deriving  P(0\f=\A(ii  iVA/a)).  Note 
that  if  I  can  derive  rules  for  combining  likelihoods  for  two 
different  models  then  by  applying  the  combination  rules  N 
times,  N  likelihoods  are  combined.  Thus  all  that  is  needed 
is  combination  rules  for  two  models. 

3.  Combining  Likelihoods  From  Different 
Models 

To  combine  likelihoods  derived  under  Af ,  and  Afa  an 
examination  of  the  structure  and  interaction  of  the  two 
models  is  necessary.  Afj  and  M%  must  have  the  same 
definition  for  the  feature  being  detected.  If  the  feature  is 
defined  differently  for  Ux  and  Afa  then  Afi  and  Afa  are  about 
different  events,  and  the  likelihoods  can  not  be  combined 
with  the  techniques  developed  in  this  section. 

thus  the  likelihood  generated  by  an  occlusion  boundary 
detector  can  not  be  combined  with  the  likelihood  generated 
by  a  detector  far  boundaries  within  the  image  of  an  object  ( 
such  as  comers  internal  to  the  image).  A  detector  of  the 
likelihood  of  heads  on  a  coin  flip  can  not  be  combined  with  a 
detector  of  the  likelihood  of  rain  outside  using  this  theory. 
(However  easy  it  may  be  using  standard  probability  theory.) 


3.1.  Combining  Two  Likelihoods 

The  formula  for  combining  the  likelihoods  generated 
under  Aft  and  Afa  requires  prior  knowledge.  Necessary  are 
the  prior  probabilities  P(Afj)  and  P(Af  a)  that  the  domain 
models  Aft  and  Afa  are  correct  as  well  as  P(Af  LAAfa).  Often 
P(Afi*Jfa)  =  0.  When  this  occurs  the  two  models  contradict 
each  other.  I  call  two  such  models  disjoint  because  both  can 
not  describe  the  situation  simultaneously.  If  Afj  is  a  model 
with  noise  of  standard  deviation  is  4±c  and  Afa  is  a  model 
with  noise  of  standard  deviation  8±c  then  their  assumptions 
contradict  and  P(Af XAM a)  =  0. 

Prior  probabilities  for  the  feature  labels  under  each 
model  (P(f-l\M  x)  and  P(f=l\M  j))  are  necessary.  If 
P(Af  iAAfj)*0  then  the  prior  probability  of  the  feature  label 
under  the  conjunction  of  Af,  and  A/a  (P{f=l\M  ,4Afa))  and 
the  output  of  a  likelihood  generator  for  the  conjunction  of  the 
two  models  (,P(0\f=l&(M  adt Afa)))  are  needed.  If  I  have  this 
prior  information  I  can  derive  P(0\f=l&(M  xvM2)). 

If  I  were  to  combine  another  model,  Afa,  with  this 
combination  I  need  the  priors  P(Af  3),  P(/|Af  3), 
P( Afa4(AfiVAfa)  and  P(f]M  aA(Afi\/Afa)).  To  add  on  another 
mode  I  need  another  4  priors.  Thus  the  number  of  prior 
probabilities  to  combine  n  models  is  linear  in  n. 

Thus  all  that  is  left  is  to  derive  the  combination  rule 
for  likelihood  generators  given  this  prior  information.  The 
derivation  starts  by  applying  the  definition  of  conditional 
probability  in  equation  2. 


P{0\f=l*(M  ,VAf,» 


P{OAf=l*(M  iVAfj)) 
P(f=l*(  M  1vAf,)) 


(2) 


The  formula  for  probability  of  a  disjunction  is  applied  to  the 
numerator  and  denominator  in  equation  3. 


P(0\f=l&(M  iVAfa))  = 


P(OAf=lAM  t) 

+ 

P(OAf  =  l&M  a) 
?(0&f=lAM  i*Afa) 

P(f=lAM ,) 

+ 

P(f-IAM  j) 
P(f=lAM  i&Afa) 


In  equation  4  the  definition  of  conditional  probability  is 
applied  again  to  the  terms  of  the  numerator  and  the 
denominator. 

P(0\f=lAM  ,) 

P{f=l\M  i)P(Afj) 

+ 

P(0\f=lAM  a) 

P(f=l\M  a)P(Afa) 

P(0 |/=i*Af  !*Afa) 
Plf=l\M  1£Af1)P(Af  t£Afa) 

P(0\f=l&(M  !VAfa))  = - - - 

P(/=/|Af  j)P(Af,) 

+ 

P(f=l\M  a)P(Afa) 

P(f=l\M  ,*Afa)P(Af  ,*Afa) 


Different  assumptions  allow  different  simplifications  to 
be  applied  to  the  rule  in  equation  4.  If  the  two  models  are 
disjoint  equation  4  reduces  to  equation  5. 

P(0\f=l&M  t)P(f-l\M  ,)P(Af ,) 

+ 

P(0  !/■=  lit  Af  jJP^ilAf  j)P(Afa) 


P(0\f=lA(M  ,vAfa))  = 


P(f—l\M  ,)P(Af ,) 
P(f — l\M  2)P(M 2) 


Another  assumption  that  simplifies  things  considerably  is 
the  assumption  that  prior  probabilities  for  all  feature 
labelings  in  all  the  models  and  combinations  thereof  are  the 
same.  I  call  this  assumption  constancy  of  priors.  When 
constancy  of  priors  is  assumed 

P(f=l\M  t)  =  P(f=l\M  a)  =  Pif=l\M  ,4Afa).  Making  this 
assumption  reduces  the  number  of  priors  that  need  to  be 
determined.  Since  determining  prior  probabilities  from  a 
model  is  sometimes  a  difficult  task  the  constancy  of  priors  is 
a  useful  simplification.  With  constancy  of  priors  equation  4 
reduces  to  equation  6. 

P(0\f=l&M  i)P(Af ,) 

P(0\f=l&M  a)P(Af  a) 


P(0\f=lA(M  ,vAf,))  = 


\P(0\f=lAM  i4Afa)P(Af  iAAfa)J 
P(Af!)+P(Afa)-P(Af  jAAf,) 


Equation  5  with  constancy  of  priors  reduces  to  equation  7. 


P(0\f-lA(M  iVAfa))  = 


P(0\f=lAM  ,lP(Af,) 
P(0\f=lAU  a)P(Af a) 
P(Afi)+P(Afa) 


656 


I 

i' 

,  Titus  equation  7  describes  the  likelihood  combination  rule 

with  disjoint  model*  and  constancy  of  priors. 

3.2.  Understanding  the  Likelihood 
Combination  Rule 

the  easiest  incarnation  of  the  likelihood  combination 
rule  to  understand  is  the  rule  for  combining  likelihoods  from 
disjoint  models  given  constancy  of  priors  across  models 
(equation  7).  Here  the  combined  likelihood  is  the  weighted 
average  of  the  likelihoods  from  the  individual  models 
weighted  by  the  probabilities  of  the  models  applying.  (The 
combined  likelihood  is  the  likelihood  given  the  disjunction  of 
the  models). 

If  models  Mi  and  M2  are  considered  equally  probable 
and  the  likelihoods  returned  by  Mi’s  detector  are 
considerably  larger  than  those  of  M2’s  detector  then  the 
probabilities  determined  from  the  combination  of  Mi  and  M| 
are  dose  to  those  determined  from  M,  Thus  a  model  with 
large  likelihoods  determines  the  probabilities.  To  illustrate 
this  principle  consider  an  example. 

Assume  that  a  coin  has  been  flipped  n  +  1  times.  The 
results  of  flipping  it  has  been  reported  for  the  first  n  times. 
The  task  is  to  determine  the  probability  of  heads  having 
been  the  result  of  the  n  +  1*1  flip.  Consider  the  results  of 
each  coin  flip  independent.  Let  Mi  be  the  coin  being  fair  so 
that  the  probability  of  heads  and  tails  is  equal.  Let  M2  be 
that  the  coin  is  biased  with  the  probability  of  heads  is  v  and 
tails  1-w  with  tr  being  a  random  choice  with  equal 
probability  between  p  and  1—  p.  Hence  the  coin  is  biased 
towards  heads  or  tails  with  equal  probability  but  the  bias  is 
consistent  between  coin  tosses.  The  probability  of  heads 
remains  the  same  for  all  coin  tosses  in  both  models.  M(  and 
Mj  are  disjoint  (the  coin  is  either  fair  or  it  isn’t  but  not  both) 
and  the  prior  probability  of  a  flip  being  heads  or  tail  is  the 
same  for  both,  .5. 

Under  Mi  the  probability  of  each  of  the  possible  flips  of 
n  +  1  coins  is  2*""1.  Under  M2  the  probability  of  n  +  1  flips 
of  coins  with  h  heads  and  t  -  n  + 1  -  h  tails  is: 

teNi-pi'+rt/U-p)* 

Let  n  =  2  and  p  =  9.  Assume  the  first  two  flips  are  both 
heads.  Let  H  be  "the  third  flip  was  heads”  and  T  be  "the 
third  flip  was  tails."  The  likelihood  of  H  given  the  observed 
data  is  the  probability  of  all  3  flips  being  heads  divided  by 
the  probability  of  the  third  flip  being  heads.  The  likelihood 
of  T  given  the  observed  data  is  the  probability  of  the  first  2 
being  heads  and  the  3rd  tails  divided  by  the  probability  of 
the  third  flip  being  tails. 

Under  M(  the  probability  of  all  3  flips  being  heads  is 
0.125  and  the  probability  of  a  flip  being  heads  is  0.5  thus  the 
likelihood  of  H  is  0.25.  The  likelihood  of  T  is  0.25  by  the 
same  reasoning.  Applying  Bayes*  law  to  get  the  probability 
of  H  under  M|  one  derives  a  probability  of  .5  . 

Under  M2  the  probability  of  all  3  flips  being  heads  is 
0366  and  the  probability  of  a  flip  being  heads  is  0.5.  Thus 
the  likelihood  of  H  is  0.73.  Under  M2  the  probability  of  the 
first  two  being  heads  and  the  third  being  tails  is  0.045  and 
the  probability  of  a  flip  being  tails  is  0.5.  Thus  the 


i 

! 

I 

i 


657 


likelihood  of  T  is  0.09.  Applying  Bayes’  law  under  M2  a 
probability  of  H  being  0.89  it  derived. 

If  M,  and  M2  are  considered  equally  probable  then  the 
combination  of  the  likelihoods  from  the  two  models  is  the 
average  of  the  two  likelihoods.  Thus  the  likelihood  of  H  for 
this  combination  is  0.49  and  the  likelihood  of  T  is  0.17 
(likelihoods  don’t  have  to  sum  to  1).  Bayes'  law  combines 
these  probabilities  to  get  0.74  for  the  3rf  flip  to  be  heads. 

The  table  in  figure  1  describes  combining  various  M2’s 
with  different  values  of  p  with  M  i  for  the  different 
combinations  with  n  =  4 

Look  at  the  probabilities  with  p  =  . 9  and  the  observed 
data  is  HHHH.  For  this  case  the  observed  data  fits  M2  much 
better  than  Mi  and  the  probability  from  combining  M;  and 
Mt  is  dose  to  the  probability  resulting  from  using  just  M2, 
.9.  If  we  had  a  longer  run  of  heads  the  probability  of  future 
heads  would  approach  exactly  Mj’s  prediction,  .9.  On  the 
other  hand  if  we  had  a  long  run  of  equal  numbers  of  heads 
and  tails  the  probability  of  future  heads  would  quickly 
approach  the  prediction  of  M2,  .5.  When  the  observed  data  is 
HHHT  the  observed  data  fits  M,  about  as  well  as  M2  and  the 
resulting  probability  is  near  the  average  of  .5  predicted  by 
M,  and  0.8902  predicted  by  M2.  Thus  when  the  observed 
data  is  a  good  fit  for  a  particular  model  (like  M2)  the 
probabilities  predicted  by  the  combination  is  close  to  the 
probabilities  predicted  by  the  fitted  model.  If  two  models  fit 
about  equally  then  the  result  is  an  average  of  the 
probabilities  *. 

4.  Results 


I  have  applied  this  evidence  combination  to  the 
boundary  detection  likelihood  generators  described  in 
[Sher87].  Here  I  prove  my  claims  that  the  evidence 
combination  theory  allows  me  to  take  a  set  of  algorithms 
that  are  effective  but  not  robust  and  derive  an  algorithm 
that  is  robust.  The  output  of  such  an  algorithm  is  almost  as 
flood  as  the  best  of  its  constituents  (the  algorithms  that  are 
combined). 

4.1.  Artificial  Images 

Artificial  images  were  used  to  test  the  algorithms 
described  in  section  3  quantitatively.  I  used  as  a  source  of 
likelihoods  the  routine*  described  in  [Sher87],  Because  the 
positions  of  the  boundaries  in  an  artificial  image  are  known 
one  can  accurately  measure  false  positive  and  negative  rates 
for  different  operators.  Also  one  can  construct  artificial 
images  to  precise  specifications.  The  artificial  images  I  use 
is  an  image  composed  of  overlapping  circles  with  constant 
intensity  and  aliasing  at  the  boundaries  shown  in  figure  2. 


'Hownr  the  feature  that  the  dedeion  theory  predicts  ie  not  the  ever, 
aft  of  the  features  predicted  under  tht  two  different  models  in  general. 


Observed  Combined  with  M,  Likelihood  of  H  Likelihood  of  T 
Coin  Flips  or  just  M2  p  =  . 6  p  =  . 9  p  =  .6  p  =  .9 

HHHH  Just  M j  0.088  0.5905  0.0672  0.0657 

_ _ Combined  0,07525  0.3265  0.06485  0.0641 

HHHT  Just  Ma  0.0672  0.0657  0.0576  0.0081 

Combined  0.06485  0.0641  0.06005  0.0353 


Probability  of  H 


P=  6 
0.567 
0.537 


0.5385  0.8902 


0.5192  0.6449 


HHTT 

HTTT 

TTTT 


Just  Mj  0.0576  0.0081  0.0575  0.0081  0.5 

Combined  0.06005  0.0353  0.06  0.0353  0.5 


Just  M2  0.0576  0.0081  0.0672 

Combined  0.06005  0.0353  0.06485 


Just  M2  0.0672  0.0657  0.088 

Combined  0.06485  0.0641  0,07525 


Figure  i:  Result  of  likelihood  combination  Rule 


a:  Image  with  stdev  12  noise  b:  Output  of  stdev  4  detector 
Figure  2:  Artificial  Test  Image  Figure  3:  Stdev  4  detector  applied  to  2  image  with  stdev  12  noise 


a:  Image  with  stdev  12  noise  b:  Output  of  stdev  12  detector  a:  Image  with  stdev  12  noise  b:  Output  of  combined  detector 

Figure  4:  Stdev  12  detector  applied  to  2  image  with  stdev  12  noise  Figure  5:  Combined  detector  applied  to  2  image  with  stdev  12  noise 


The  intensities  of  the  circles  were  selected  from  a  uniform 
distribution  from  0  to  254.  To  the  circles  were  added 
normally  distributed  uncorrelated  noise  with  standard 
deviations  4,  8,  12,  16,  20,  and  32.  The  software  to  generate 
images  of  this  form  was  built  by  Myra  Van  Inwegen  working 
under  my  direction.  This  software  will  be  described  in  an 
upcoming  technical  report. 

In  figure  3  I  show  the  result  of  applying  the  detector 
tuned  to  standard  deviation  4  noise  to  the  artificial  image 


with  standard  deviation  12  noise  added  to  it.  In  figure  4  I 
show  the  result  of  applying  the  detector  tuned  to  standard 
deviation  12  noise  to  an  image  with  standard  deviation  12 
noise  added  to  it.  In  figure  5  I  show  the  result  of  applying 
the  combination  of  the  detectors  tuned  to  4,  8,  12,  and  16 
standard  deviation  noise.  The  combination  rule  was  that  for 
disjoint  models  with  the  same  priors.  The  4  models  were 
combined  with  equal  probability.  These  operator  outputs  are 
thresholded  at  0.5  probability  with  black  indicating  an  edge 
and  white  indicating  no  edge. 


0  5  10  15  20  25  30  35 


0  5  10  15  20  25  30  35 


Figure  S:  Total  errors  by  the  stdev  4  detector 


Figure  7:  Total  errors  by  the  stdev  12  detector 


10  15  20  25  30  35 


Figure  8:  Total  errors  by  the  tuned  detector 


Figure  9:  Total  errors  by  the  combined  detector 


Note  that  the  result  of  using  the  combined  operator  is 
similar  to  that  of  the  operator  tuned  to  the  correct  noise 
level.  Most  of  the  false  boundaries  found  by  the  stdev  4 
operator  are  ignored  by  the  combined  operator. 

Using  this  artificial  image  I  have  acquired  statistics 
about  the  behavior  of  the  combined  detector  vs  the  tuned 
ones  under  varying  levels  of  noise.  I  have  charted  the  total 
error  rate  for  the  artificial  image  in  figure  2  with  normally 
distributed  specially  independent  noise  with  increasing 
standard  deviation.  Figure  6  shows  the  error  rate  for  a 
detector  tuned  to  standard  deviation  4  noise  as  the  noise  in 
the  image  increases.  Figure  7  shows  the  error  rate  for  a 
standard  deviation  12  operator.  Figure  8  shows  the  error 
rate  for  an  operator  tuned  to  the  current  standard  deviation 
of  the  noise.  Figure  9  shows  the  error  rate  of  the  detector 
that  is  the  combination  of  the  detectors  tuned  to  a  standard 
deviation  of  4,  8,  12  and  16  using  the  rule  for  combining 
diqoint  models  witht  the  same  priors  for  the  feature  (since 
the  probability  of  a  boundary  is  unaffected  by  the  noise  level 
of  the  sensor).  Figure  10  shows  the  superposition  of  the  4 
previous  graphs. 


10  15  20  25  30  35 


square:  stdev  4  operator 

triangle:  tuned  operator 


circle:  stdev  12  operator 

cross:  combined  operator 


Figure  10:  Total  errors  by  all  detectors 


659 


The  error  rate  for  the  combined  operator  is  always 
nearly  as  low  as  that  of  the  tuned  operator  (shown  as 
triangles).  It  is  lower  than  the  standard  deviation  12 
operator  when  the  noise  is  low  and  lower  than  the  standard 
devation  4  operator  when  the  noise  is  high  These  results 
are  evidence  that  my  combination  rule  is  robust. 

4.2.  Real  Images 

I  have  also  tested  these  theories  using  two  images 
taken  by  cameras.  One  of  these  images  is  a  tinker  toy  image 
taken  in  our  lab.  The  other  is  an  aerial  image  of  the  vicinity 
of  Lake  Ontario.  Figure  11  shows  the  result  of  the  operator 
tuned  to  standard  deviation  4  noise  applied  to  the  tinker  toy 
image  and  thresholded  at  0.5  probability.  Figure  12  shows 
the  result  of  the  operator  tuned  to  standard  deviation  12 
noise  applied  to  the  tinker  toy  image.  Figure  13  shows  the 
effect  of  combining  operators  tuned  to  standard  deviation  4, 
3,  12  and  16  with  equal  probability. 


Here,  the  result  of  the  combined  operator  seems  to  be  a 
cleaned  up  version  of  the  standard  deviation  4  operator 
Mosc  of  the  features  that  are  represented  in  the  output  of  the 
combined  operator  are  however  real  features  of  the  scene 
The  line  runing  horizontally  across  the  image  that  the 
standard  deviation  4  operator  and  the  combined  operator 
found  is  the  place  where  the  table  meets  the  curtain  behind 
the  tinkertoy.  The  standard  deviation  4  operator  was  certain 
of  its  interpretation  and  the  other  operators  were  uncertain 
at  that  point  so  its  interpretation  was  used  by  the 
combination. 

The  results  from  the  aerial  image  are  also  instructive. 
Figure  14  shows  the  result  of  the  operator  tuned  to  standard 
deviation  4  noise  applied  to  the  aerial  image  and  thresholded 
at  0.5  probability.  Figure  15  shows  the  result  of  the  operator 
tuned  to  standard  deviation  12  noise  applied  to  the  aerial 
image.  Figure  16  shows  the  effect  of  combining  operators 
tuned  to  standard  deviation  4,  8,  12  and  16  with  equal 
probability. 


...wr' 

l 

a  Tinkertoy  Image 

Figure  1 1 :  Stdev  4  detector 


Ichick  .not^edge,  16,5x5,  . 7 ,  .01 


b:  Output  of  stdev  4  detector 
applied  to  tinkertoy  image 


a:  Tinkertoy  Image  b:  Output  of  stdev  12  detector 

Figure  12:  Stdev  12  detector  applied  to  tinkertoy  image 


Ich  i  ck  .iff 


a:  Tinkertoy  Image 

Figure  13:  Combined  detector 


b:  Output  of  combined  detector 
applied  to  tinkertoy  image 


a:  Aerial  Image  b:  Output  of  stdev  4  detector 

Figure  14:  Stdev  4  detector  applied  to  aerial  image 


a:  Aerial  Image  b:  Output  of  stdev  12  detector 

Figure  15:  Stdev  12  detector  applied  to  aerial  image 


650 


a:  Aerial  Image  b:  Output  of  combined  detector 

Figure  16:  Combined  detector  applied  to  aerial  image 

The  results  from  the  combined  operator  are  again  a 
cleaned  up  version  of  the  results  from  the  standard  deviation 
4  operator.  I  believe  this  behavior  occurs  again  because  the 
features  being  found  by  the  standard  deviation  4  operator 
are  in  the  scene  However  I  do  not  have  the  intimate 
knowledge  of  the  aerial  image  that  I  do  of  the  tinkertoy 
image. 

4.3.  Future  Experiments 

Soon,  1  will  apply  my  evidence  combination  rules  to 
operators  that  make  different  assumptions  about  the 
expected  image  histogram.  The  operator  used  so  far  in  my 
experiments  expects  a  uniform  histogram  between  0  and  254. 
Currently,  a  likelihood  generator  has  been  built  that 
assumes  a  triangular  distribution  with  the  probability  of  an 
object  having  intensity  less  than  128  being  one  fourth  the 
probability  of  an  object  having  intensity  greater  than  or 
equal  to  128.  It  is  not  clear  that  the  probabilities  calculated 
based  on  this  assumption  will  be  significantly  different  from 
those  based  on  the  uniform  histogram  assumption  If  there 
is  no  difference  in  the  output  of  two  operators  the  effect  of 
combination  is  invisible. 

Larger  operators  will  soon  be  available.  The 
likelihoods  generated  based  on  these  larger  operators  would 
be  finely  tuned.  The  same  evidence  combination  can  be 
applied  to  these  operators 

Likelihoods  are  used  by  Markov  random  field 
algorithms  to  determine  posterior  probabilities 
[Marroquin85a]  [Chou87],  Likelihoods  resulting  from  my 
combination  rules  can  be  used  by  Markov  random  field 
algorithms. 

5.  Previous  Work 

Much  of  the  work  on  evidence  and  evidence 
combination  and  vision  has  been  on  high  level  vision  An 
important  Bayesian  approach  (and  a  motivation  for  my 
work)  was  by  Feldman  and  Yakimovsky  [Feldman74].  In 
this  work  Feldman  and  Yakimovsky  were  studying  region 
merging  based  on  high  level  constraints.  They  first  tried  to 
find  a  probability  distribution  over  the  labels  of  a  region 
using  its  characteristics  such  as  mean  color  or  texture.  They 
then  tried  to  improve  these  distributions  using  labelings  for 


the  neighbors.  Then  they  made  merge  decisions  based  on 
whether  it  was  sufficiently  probable  that  two  adjacent 
regions  were  the  same. 

Work  with  a  similar  flavor  has  been  done  by  Hanson 
and  Riseman.  In  [Hanson80]  Bayesian  theories  are  applied 
to  edge  relaxation.  This  work  had  serious  problems  with  its 
models  and  the  fact  that  the  initial  probabilities  input  were 
edge  strengths  normalized  never  to  exceed  1.  Of  course  such 
edge  strengths  have  little  relationship  to  probabilities  (a 
good  edge  detector  tries  to  be  monotonic  in  its  output  with 
probability  but  that  is  about  as  far  as  it  gets).  In 
[Wesley82a]  and  [Wesley82b]  Dempster-Shafer  evidence 
theory  is  used  to  model  and  understand  high  level  problems 
in  vision  especially  region  labeling.  In  [Wesley82b]  there  is 
some  informed  criticism  of  Bayesian  approaches.  In 
[Reynolds85]  They  study  how  one  converts  low  level  feature 
values  into  input  for  a  Dempster-Shafer  evidence  system. 

There  has  been  much  use  of  likelihoods  in  recent  vision 
work.  In  particular  work  based  on  Markov  random  fields 
[Geman84]  [Marroquin85b]  [Marroquin85a]  use  likelihoods. 
A  Markov  random  field  is  a  prior  probability  distribution  for 
some  feature  of  an  image  and  the  likelihoods  are  used  to 
compute  the  marginal  posterior  probabilities  that  are  used  to 
update  the  field.  Haralick  has  mentioned  that  his  facet 
model  [Haralick84]  [Haralick86b]  can  be  easily  used  to  build 
edge  detectors  that  return  likelihoods  [Haralick86a].  I  also 
have  built  boundary  detectors  that  return  hkelihoods  and 
the  results  of  using  them  is  documented  in  [Sher87J.  Paul 
Chou  is  using  the  hkelihoods  I  produce  with  Markov  random 
fields  for  edge  relaxation  [Chou87],  He  is  also  studying  the 
use  of  likelihoods  for  information  fusion.  Currently,  he  is 
concentrating  on  information  fusion  from  different  sources  of 
information. 

6.  Conclusion 

I  have  presented  a  Bayesian  technique  for  information 
fusion.  I  show  how  to  fuse  information  from  detectors  with 
different  models.  I  presented  results  from  applying  these 
techniques  to  artificial  and  real  images. 

These  techniques  take  several  operators  that  are  tuned 
to  work  well  when  the  scene  has  certain  particular 
properties  and  get  an  algorithm  that  works  almost  as  well  as 
the  best  of  the  operators  being  combined.  Since  most 
algorithms  available  for  machine  vision  are  erratic  when 
their  assumptions  are  violated  this  work  can  be  used  to 
improve  the  robustness  of  many  algorithms. 

7.  Acknowledgements 

This  work  would  have  been  impossible  without  the 
advioe  and  argumentation  of  such  people  as  Paul  Chou  and 
Mike  Swain  (who  has  made  suggestions  from  the  beginning) 
and  of  course  my  advisor  Chris  Brown,  and  who  could  forget 
Jerry  Feldman.  This  work  was  supported  by  the  Defense 
Advanced  Research  Projects  Agency  U.  S.  Army  Engineering 
Topographic  Labs  under  grant  number  DACA76-85-C-0001, 
the  National  Science  Foundation  under  grant  number  DCR- 
8320136.  Also  this  work  ws  supported  by  the  Air  Force 
Systems  Command,  Rome  Air  Development  Center,  Griffiss 


661 


Air  Force  Base,  New  York  13441-5700,  and  the  Air  Force 
Office  of  Scientific  Research,  Bolling  AFB,  DC  20332,  under 
Contract  No.  F30602-85-C-0008.  This  contract  supports  the 
Northeast  Artificial  Intelligence  Consortium  (NAIC). 

References 


[Aloimonos85] 

J.  Aloimonos  and  P.  Chou,  Detection  of  Surface 
Orientation  and  Motion  from  Texture:  1.  The  Case 
of  Planes,  161,  Computer  Science  Department, 
University  of  Rochester,  January  1985. 

[Ballard82] 

D.  H.  Ballard  and  C.  M.  Brown,  in  Computer 
Vision,  Prentice-Hall  Inc.,  Englewood  Cliffs,  New 
Jersey,  1982,  125. 

[Chou87]  P.  Chou,  Multi-Modal  Segmentation  using  Markov 
Random  Fields,  Submitted  to  IJCAI,  January 
1987. 

[Feldman74] 

J.  A.  Feldman  and  Y.  Yakimovsky,  Decision 
Theory  and  Artificial  Intelligence:  I.  A 
Semantics-Based  region  Analyzer,  Artificial 
Intelligence  5(1974),  349-371,  North-Holland 

Publishing  Company. 

[Geman84] 

S.  Geman  and  D.  Geman,  Stochastic  Relaxation, 
Gibbs  Distributions,  and  the  Bayesian  Restoration 
of  Images,  PAMI  6,6  (November  1984),  721-741, 
IEEE. 

[Good50]  I.  J.  Good,  Probability  and  the  Weighing  of 
Evidence,  Hafher  Publishing  Company.,  London, 
New  York,  1950 

[Good83]  I.  J.  Good,  Subjective  Probability  as  the  Measure 
of  a  Non-measurable  Set,  in  Good  Thinking:  The 
Foundations  of  Probability  and  its  Applications, 
Minneapolis  (editor),  University  of  Minnesota 
Press,  Minneapolis,  1983,  73-82 

[Hanson80] 

A.  R.  Hanson,  E.  M.  Riseman  and  F.  C.  Glazer, 
Edge  Relaxation  and  Boundary  Continuity,  80-11, 
University  of  Massachusetts  at  Amherst, 
Computer  and  Information  Science,  May  1980. 

[Haralick84] 

R.  M.  Haralick,  Digital  Step  Edges  from  Zero 
Creasing  of  Second  Directional  Derivatives,  PAMI 
6,1  (January  1934),  58-68  rEEE. 

[Haralick86a] 

R.  Haralick,  Personal  Communication,  June  1986. 
[Haralick  86b] 

R.  M.  Haralick,  The  Facet  Approach  to  Gradient 
Edge  Detection,  Tutorial  1  Facet  Model  Image 
Processing  (CVPR),  May  1986. 


[Hom70]  B.  K.  P.  Horn,  Shape  from  Shading:  A  Method  for 
Finding  the  SHape  of  a  Smooth  Opaque  Object 
from  One  View,  Massachusetts  Institute  of 
Technology  Department  of  ELectrical 
Engineering.,  August  1970 

[Ikeuchi80] 

K.  Ikeuchi,  Shape  form  Regular  Patterns  (an 
Example  of  Constraint  Propagation  in  Vision), 
567,  Massachusetts  Institute  of  Technology, 
Artificial  Intelligence  Laboratory,  March  1980. 

[Marroquin85a] 

J.  L.  Marroquin,  Probabilistic  Solution  of  Inverse 
Problems,  Tech.  Rep.  860,  MIT  Artificial 
Intelligence  Laboratory,  September  1985. 
[Marroquin85b] 

J.  Marroquin,  S.  Mitter  and  T.  Poggio, 
Probabilistic  Solution  of  Ill-Posed  Problems  in 
Computational  Vision,  Proceedings:  Image 
Understanding  Workshop,  December  1985,  293- 
309.  Sponsored  by:  Information  Processing 
Techniques  Office  Defence  Advanced  Research 
Projects  Agency. 

[Ohlander79] 

R.  Ohlander,  K.  Price  and  D.  R.  Reddy,  Picture 
Segmentation  using  a  Recursive  Region  Splitting 
Method,  CGIP  8,3  (1979). 

[ReynoldB85] 

G.  Reynolds,  D.  Strahman  and  N.  Lehrer, 
Converting  Feature  Values  to  Evidence, 
PROCEEDINGS:  IMAGE  UNDERSTANDING 
WORKSHOP,  December  1985,  331-339 

Sponsored  by:  Information  Processing  Techniques 
Office,  Defence  Advanced  Research  Projects 
Agency. 

[Sher86]  D.  Sher,  Optimal  Likelihood  Detectors  for 
Boundary  Detection  Under  Gaussian  Additive 
Noise,  IEEE  Conference  on  Computer  Vision  and 
Pattern  Recognition,  Miami,  Florida,  June  1986. 

[Sher87]  D.  B.  Sher,  Advanced  Likelihood  Generators  for 
Boundary  Detection.  TR197,  University  of 
Rochester  Computer  Science  Department, 

Rochester  NY  14627,  January  1987. 

[Wesley82a] 

L.  P.  We-ley  and  A.  R.  Hanson,  The  Use  of  an 

Evidential-BaseJ  Model  for  Representing 

Knowledge  and  Reasoning  about  Images  in  the 
Visions  System,  PAMI  4,5  (Sept  1982),  14-25, 

IEEE. 

[Wesley82b] 

L.  P.  Wesley  and  A.  R.  Hanson,  The  use  of  an 
Evidential-Based  Model  for  Repret  '.ting 

Knowledge  and  Reasoning  about  Images  in  the 
VISIONS  System,  Proceedings  of  the  Worksshop 
on  Computer  Vision:  Representation  and  Control, 
August  1982, 14-25. 


662 


Multi-Modal  Segmentation  using  Markov  Random  Fields 

Paul  B.  Chou  and  Christopher  M.  Brown 
Computer  Science  Department 
The  University  of  Rochester 
Rochester,  New  York  14627 
ARPA:  chou@rochester.arpa  brown@rochester.arpa 
UUCP:  {allegra,seismo,decvax}!rochester!chou 
{allegra,seismo,decvax}!rochester!brown 


ABSTRACT 

This  paper  develops  a  new  approach,  based  on  Bayesian 
probability  theory,  to  combine  information  from  various 
sources  for  image  segmentation  In  this  approach,  the 
observable  evidence  and  prior  knowledge  are  separately 
modeled  due  to  their  distinct  characteristics.  A  set  of  early 
visual  modules  provide  opinions  about  individual  image 
elements  based  on  disparate  sources  of  image  observations. 
Their  opinions  are  combined  coherently  and  consistently 
through  a  hierarchically  structured  knowledge  tree.  The  prior 
knowledge  about  spatial  interactions  of  the  image  features  is 
modeled  by  using  Markov  Random  Fields.  This  approach 
constant. y  maintains  the  a  posteriori  probabilities  of 
sementations  resulting  from  combining  the  prior  knowledge 
and  the  available  opinions.  The  probabilistic  justification  for 
this  approach  is  provided.  A  set  of  experimental  results  using 
synthetic  input  and  a  stochastic  relaxation  estimation 
procedure  demonstrates  some  of  the  advantages  of  this 
approach. 

0.  Background  and  Motivation 

An  important  aspect  of  computer  vision  is  to  infer  a 
representation  of  a  three  dimensional  scene  (tom  two  dimen¬ 
sional  projections  yielding  features  like  shadows,  stereo 
disparities,  texture,  and  optical  flow  -  all  corrupted  by  noise. 
The  problem  is  ill-posed  given  only  the  input  projections  and 
the  characteristics  of  the  imaging  devices.  Therefore,  com¬ 
puter  vision  research  does  not  just  study  the  geometry  and 
the  photometry  of  the  cameras,  but  also  relies  on  scene 
modeling  (prior  knowledge)  and  mechanisms  for  integrating 
the  scene  models  with  the  input  observations  to  make  3D 
inferences  possible. 

A  major  portion  of  computer  vision  research  for  the 
past  decade  has  been  devoted  to  the  so-called  intrinsic  image 
computations  [Banow  &  Tenenbaum,  78][Marr,  82],  The 
emphasis  of  these  efforts  is  on  the  relations  between  scene 
parameters  and  individual  image  features  (e  g  shape-from- 
shading  and  shape-from-contoursl.  Due  to  the  ill-posed 


nature  of  each  problem,  most  of  them  assume  that  the 
images  have  been  already  segmented  into  "homogeneous" 
regions  so  that  they  can  apply  some  global  assumptions 
(prior  knowledge)  to  regularize  their  computations  [Ikeuchi 
and  Horn,  81]  [Grimson,  81]  [Terzopoulos,  86].  The  global 
assumptions  imposed  by  different  computational  modules 
imposed  may  be  inconsistent,  so  it  is  not  obvious  how  to  com¬ 
bine  their  results  coherently. 

We  propose  a  new  approach,  based  on  Bayesian  proba¬ 
bility  theory,  for  integrating  information  from  various 
sources.  It  consistently  and  coherently  combines  the  opinions 
of  the  early  modules  and  updates  the  probabilities  of  the 
image  features  accordingly.  Each  of  the  early  visual 
modules  acts  as  an  expert  on  a  set  of  hypotheses  in  a 
hierarchical  knowledge  structure  H  that  relates  individual 
image  elements.  The  prior  knowledge  about  the  spatial 
interactions  of  the  image  features  is  modeled  as  Markov 
Random  Fields.  This  approach  is  useful  for  the  image  seg¬ 
mentation  problem,  and  can  be  extended  to  solve  other  sig¬ 
nal  estimation  problems  as  well 

The  Bayesian  formalism  is  widely  used  as  a  tool  for 
updating  "belief'  as  new  information  is  acknowledged  Bolles 
uses  a  set  of  probabilistic  feature  detectors  to  look  for  certain 
objects  in  the  image  [Bolles,  77],  Witkin  uses  the  Bayesian 
estimation  paradigm  to  estimate  the  orientation  of  a  tex¬ 
tured  plane  [Witkin,  81].  [Feldman  and  Yakimovsky,  74] 
[Geman  and  Geman,  84]  [Marroquin,  85]  [Elliott  and  Derin, 
84]  use  it  to  solve  the  segmentation  problem.  In  [Bolle  and 
Cooper,  84],  a  Bayesian  classifier  is  developed  to  estimate 
the  underlying  surface  types  of  image  regions.  A  class  of 
research  known  as  Probabilistic  Relaxation  [Rosenfeld  et  al  , 
76]  [Shvayster  and  Peleg,  85]  [Hummel  and  Zucker,  83],  was 
also  inspired  by  the  Bayesian  formalism.  It  interprets  spa¬ 
tial  knowledge  as  statistical  quantities  related  to  the  condi¬ 
tional  probabilities,  correlations,  or  compatibility  among 
neighboring  elements,  and  develops  updating  rules  to  incor¬ 
porate  contextual  information 

What  is  lacking  in  the  previous  work,  however,  is  a 
uniform  fusion  mechanism  to  integrate  disparate  visual 
information  in  a  probabilistically  justifiable  manner. 

A  visual  module  could  provide  opinions  on  any  mutu¬ 
ally  exclusive  set  of  labels  in  the  hierarchical  knowledge  tree 
H.  We  shall  develop  an  evidence  aggregation  method  that 
combines  consistently  and  coherently  the  opinions  of  the 


553 


visual  nodules  on  a  label  tree.  This  method,  based  on  the 
reasoning  proposed  in  [Pearl,  86],  follows  the  Bayesian  for¬ 
malism.  It  does  not  require  any  knowledge  of  prior  probabil¬ 
ities.  It  requires  only  trivial  computations.  With  this 
method,  we  are  able  to  design  individual  experts  for  a  subset 
of  labels  of  the  tree  without  having  to  know  about  the  rest  of 
the  world.  The  combined  opinion  can  thus  be  fused  with  the 
image  knowledge  represented  by  the  a  priori  probabilistic 
distributions. 

In  Section  1,  we  define  the  segmentation  problem  and 
describe  the  representations  for  prior  knowledge  and  the 
opinions  from  early  modules.  In  Section  2,  we  discuss  how  to 
encode  the  prior  knowledge  as  Markov  Random  Reids  The 
method  for  combining  opinions  of  the  early  modules  and  its 
probabilistic  justification  are  presented  in  Section  3.  Section 
4  shows  the  computations  for  the  a  posteriori  probabilities. 
Illustrative  experimental  results  are  shown  in  Section  5. 

1.  A  Probabilistic  View  of  the  Image  Segmentation 
Problem 

Represent  an  image  as  a  set  of  primitive  elements  S  = 
{®i.  *i>  •  ■  • .  *«}■  A  segmentation  u  of  the  image  with  respect 
to  a  label  set  L  =  {fi,  Ij, .  . . ,  is  a  mapping  from  S  to  L. 
Let  oi,  =  wis,)  €L  represent  the  label  attached  to  3,  in  seg¬ 
mentation  a.  Let  0  be  the  set  of  all  segmentations.  The 
image  segmentation  problem  with  respect  to  L  can  be  loosely 
described  as  to  find  the  u  €  0  that  "best  fits"  the  informa¬ 
tion  collected  subject  to  the  limitation  of  the  computational 
resources.  This  section  describes  the  uses  of  probability  as 
the  representation  for  various  kinds  of  information  and  the 
corresponding  criteria  for  finding  the  "best  fit". 

It  is  frequently  desirable  to  organise  segmentation 
labels  as  a  hierarchical  tree.  Figure  X  shows  one  example  of 
such  trees.  Each  internal  node  in  a  tree  of  labels  represents 
the  disfunction  of  its  sons.  Each  cross-section  is  a  mutually 
exclusive  and  exhaustive  label  set;  i.e.,  a  segmentation  prob¬ 
lem  can  be  defined  with  respect  to  a  cross-section  in  a  label 
tree.  Using  such  a  tree,  we  can  represent  a  particular  piece 
of  knowledge  about  the  labels  at  whatever  level  of  abstrac¬ 
tion  that  is  appropriate.  For  the  rest  of  the  paper,  we  use  L 
to  denote  a  set  of  mutually  exclusive  and  exhaustive  set  of 
labels  in  a  label  tree  H. 


1.1.  Global  Prior  Knowledge 

Let  X  =  {X,,  s€S}  be  a  set  of  random  variables  ind«yfd 
by  S,  with  X,6L  for  all  s.  A  segmentation  can  be  considered 
a  realisation,  or  a  configuration,  of  this  random  field  and  Q 
can  be  considered  the  configuration  space  of  X.  Ideally  the 
prior  knowledge  about  Q  can  be  represented  by  a  probability 
distribution  over  Q.  In  practice  this  distribution  is  either 
unobtainable  or  unmanageable  due  to  the  immense  sixe  (QN) 
of  the  sample  space.  In  many  image  understanding  applica¬ 
tions,  however,  some  restricted  classes  of  distributions  can 
model  the  image  adequately  due  to  the  local  behavior  of  the 
image  phenomena.  In  this  paper,  we  will  exclusively  use 
Markov  Random  Fields  (MRFs)  as  the  a  priori  models  for  0, 
but  the  work  illustrated  here  can  be  extended  to  other 
image  models  as  well.  Section  2  will  discuss  the  MRF  model 
in  detail. 

1  -2.  Local  Visual  Observations 

We  model  early  visual  computations  as  the  computa¬ 
tions  performed  by  a  set  of  independent  modules.  The  input 
for  these  modules  is  noise-corrupted  visual  data,  such  as 
image  irradiance,  texture,  stereo  disparities,  etc.  When 
making  an  opinion  about  image  element  s,  a  module  may 
restrict  its  consideration  of  input  to  some  spatial  region 
dependent  on  s.  Typically,  the  region  will  include  a  and  its 
spatially  adjacent  elements.  Also,  a  module  might  not  use 
all  the  input  available  for  a  given  element,  that  is,  it  might 
use  only  irradiance  when  other  data  are  also  available.  The 
subset  of  input  used  to  make  an  opinion  about  element  a  is 
observation  0,. 

In  our  treatment,  the  opinions  of  the  modules  are 
presented  in  terms  of  likelihood  ratios.  Fot  example,  module 
A  is  an  expert  on  the  label  aet  L*  =  £L.  After  observ¬ 

ing  0„  the  module  reports  one  likelihood  ratio  for  each  label 
in  La.  A  likelihood  ratio  is  the  probability  of  the  observation 
given  that  one  label  truly  applies  divided  by  the  probability 
of  the  observation  should  none  of  the  labels  in  L*  apply.  For 
example,  the  likelihood  ratio  reported  for  label  l,  is: 

•  a  _  PfO.IO 

•  ~  P(o, d  o) 
«La 

Note  that  we  define  f>(0|',(  U  {))  to  be  1  when  La  is 

«L„ 

exhaustive.  The  methods  for  designing  such  modules  are 
well  known.  Interested  readers  can  consult  [Bolles,  77]  and 
[Sher,  87]. 

For  a  purpose  that  will  soon  become  clear,  we  impose 
the  following  assumption  of  conditional  independence 
between  spatially  distinct  observations: 

p(oAi«>  =  n*o.Ai«.>  (u) 

where  the  superscript  A  indicates  the  observations  of  the 
module  A.  This  assumption  has  bean  used  implicitly  in 
numerous  applications  and  is  valid  whenever  the  noise 
processes  are  spatially  independent  [Darin  and  Cole, 
86IMarroquin,  85]. 


664 


1J.  Posterior  Probability  and  Bayesian  Estimation 

Following  tbs  Bayesian  formalism,  the  goodness  of  a 
segmentation  can  be  evaluated  in  terms  of  its  a  posteriori 

expected  lass, 

EtLoss(«|0))  =  2foss(u/)P(/|0)  a  9) 

ft  a 


where  the  P(f\0)  denotes  the  posterior  probability  of  f  given 
the  observation  O.  Bayes’  rule  can  then  be  used  to  derive 
the  a  posteriori  probability 


P(u|0)  = 


P(*>)P(Q|«) 

Smptoifl 

ft  a 


(1.3) 


From  (1.1),  observe  that  scaling  all  P(0,\l)  by  a  constant  fac¬ 
tor  for  fixed  s  does  not  change  the  posterior  distribution  in 
(1.3).  This  fact  allows  us  to  combine  the  likelihoods  without 
having  to  normalise  the  results. 

The  choice  for  the  loss  function  depends  on  the  charac¬ 
teristics  of  a  particular  application.  Different  estimation 
algorithms  can  be  designed  according  to  the  different  loss 
functions.  In  [Goman  and  Geman,  84],  the  Maximum  A  Pos¬ 
teriori  (MAP)  estimation  is  used.  A  simulated  annealing  pro¬ 
cedure  with  a  stochastic  sampler  (Gibbs  sampler)  carries  out 
the  computation.  In  [Marroquin,  85],  the  Maximiser  of  the 
Posterior  Marginals  (MPM)  estimation  is  proposed  for  the 
segmentation  problem.  A  Monte  Carlo  procedure  that  uses 
the  Gibbs  sampler  collects  sample  statistics  and  makes  the 
estimation.  Figure  2  is  a  block  diagram  for  a  segmentation 
system  following  the  methodology  proposed  in  this  paper. 
Hu,  approach,  computing  the  a  posteriori  probabilities  for 
the  segmentations  given  the  set  of  opinions  from  the  early 
modules  and  the  a  priori  probability  distribution  of  the 
image,  can  support  both  the  MAP  and  MPM  estimation 
methods  as  well  as  other  Bayesian  estimations. 


2.  Markov  Random  Fields 

Markov  Random  Fields  have  been  used  for  image 
modeling  in  many  applications  for  the  past  few  years  [Ha se¬ 
ller  and  S lansky,  80]  [Gemsn  and  Geman,  84]  [Marroquin, 
85]  [Cross  and  Jain,  83]  [Derin  and  Cole,  86].  One  of  the 
most  successful  applications  of  MRFs  is  to  model  the  spatial 
interactions  of  image  features.  In  this  section,  we  review  the 
properties  of  MRFs  and  describe  how  to  encode  prior 
knowledge  in  this  formalism.  We  refer  the  reader  to  [Kinder- 
man  and  Snell,  81]  for  an  extensive  treatment  of  MRFs. 


2.1.  Definition 

Let  X  =  [X,,  s€S}  be  a  set  of  random  variables  indexed 
by  S  and  E  a  set  of  unordered  2-tuple  (s„  s,)’s  representing 
the  connections  between  the  elements  in  S.  The  set  E 
defines  a  neighborhood  system  N  =  {W,|s€S],  where  S,  is 
the  neighborhood  of  s  in  the  sense  that 

(1)  sfN„  and 

(2)  riN,  if  and  only  if  (s,  r)fE. 

Let  u  =  {X,  =  a,  sfS]  be  a  configuration  of  X,  u,€L.  and  Q 
the  set  of  all  possible  configurations.  We  say  X  is  a  Markov 
Random  Field  with  respect  to  N  if  and  only  if 

P(X  =  u)  >  0  for  all  (2  D 

P(X,=u,\Xr  =  u„r€S, r*s)  =  P(X,=u,\Xr  =  ur,rtN,)  (2.2) 

The  conditional  probabilities  in  the  right-hand  side  of 
(2.2)  are  called  the  local  characteristics  that  characterize  the 
random  field.  An  intuitive  interpretation  of  (2.2)  is  that  the 
contextual  information  provided  by  S-s  to  s  is  the  same  as 
the  information  provided  by  the  neighbors  of  s.  Thus  the 
effects  of  members  of  the  field  upon  each  other  is  limited  to 
local  interaction  as  defined  by  the  neighborhood.  A  very 
desirable  property  of  MRFs  that  makes  them  attractive  to 
scientists  in  many  disciplines  is  the  MRF-Gibbs  equivalence 
described  in  the  following  theorem. 


21  MRF-Gibbs  Equivalence 

Hammersley -Clifford  Theorem:  A  random  field  X  is  an 
MRF  with  respect  to  the  neighborhood  system  N  if  and  only 
if 


P(u)  =  - — -  for  all  uiO 

z 


(2.3) 


where 

U(u)  =  (2.4) 

etc 

C  is  the  set  of  totally  connected  subgraphs  (cliques)  with 
respect  to  N.  Z  is  a  normalising  constant,  so  that  the  proba¬ 
bilities  of  all  realisations  sum  to  one. 

Several  terminologies  from  Physics  can  provide  intui¬ 
tion  about  the  Gibbs  measure  -  the  right-band  side  of  (2.3).  T 
is  the  temperature  of  the  field  that  controls  the  flatness  of 
the  distribution  of  the  configurations.  A  potential  V  is  a  way 
to  assign  a  number  Vc(u)  to  every  subconfiguration  wc  of  a 


665 


configuration  <■>,  where  c€C.  U{u),  the  sum  of  the  local 
potentials,  is  the  energy  of  the  configuration  u  .  A  system  is 
in  thermal  equilibrium  when  the  probabilities  of  its 
configurations  follows  the  Gibbs  measure. 

With  the  Hammersley -Clifford  Theorem,  a  MRF  can  be 
characterized  by  the  potential  function  V  instead  of  the  local 
characteristics.  The  joint  probabilities  of  the  random  vari¬ 
ables  in  X  can  be  computed  by  summing  up  the  local  poten¬ 
tial  assignments.  The  local  characteristics  can  be  computed 
from  the  potential  function  through  the  following  relation: 


where  C,  is  the  set  of  cliques  that  contain  s,  and  u'  is  any 
configuration  of  the  field  that  agrees  with  u  everywhere 
except  possibly  s.  [Geman  and  Genian,  84]  proves  that  by 
using  the  Gibbs  sampler  -  randomly  and  repeatedly  selecting 
configurations  for  each  variable  in  X  according  to  (2.5),  the 
field  will  reach  thermal  equilibrium  eventually. 


2.3.  Encoding  Prior  Knowledge 

For  the  image  segmentation  problem,  the  goal  is  to 
choose  an  appropriate  neighborhood  system  and  a  potential 
function  for  the  random  field  X  over  the  image  S  to  represent 
our  prior  knowledge  about  the  image.  The  neighborhoods 
should  be  large  enough  to  capture  the  interactions  between 
the  primitive  elements  but  still  small  enough  for  a  machine 
to  carry  out  the  computations  required  to  make  an  estima¬ 
tion  For  a  lattice-structured  image,  four  and  eight- 
connected  neighborhood  systems  are  the  most  commonly 
used.  Once  a  neighborhood  system  has  been  decided  ;  the 
potential  function  can  be  estimated,  at  least  in  principle, 
given  a  set  of  realizations  of  the  uncorrupted  image.  Some 
pioneer  work  has  studied  several  statistical  estimation 
methods  for  homogeneous  MRFs  over  lattices  [Cross  and 
Jain,  83]  [Elliott  and  Derin,  84].  We  feel  that  a  general 
theory  for  choosing  the  neighborhood  system  and  the  poten¬ 
tial  function  still  lies  far  ahead.  For  now,  we  choose  the 
parameters  for  the  potential  function  to  represent  our  gen¬ 
eral  knowledge  about  the  world  in  an  ad  hoc  way  based  on 
the  following  observation:  The  higher  the  energy  measure  of 
a  configuration,  the  less  likely  it  is  to  occur  We  assign  high 
potential  values  to  those  "improbable"  subconfigurations  over 
cliques  and  low  potentials  to  those  subconfigurations  that  we 
expect  to  see  frequently.  Our  experimental  work  is  showing 
satisfactory  results  (Section  5). 


3.  Combining  Opinions  of  Early  Visual  Modules 

Most  research  on  evidence  combination  has  focused  on 
updating  the  "belief"  in  a  given  hypothesis  about  an  indivi¬ 
dual  element  when  a  piece  of  new  evidence  becomes  avail¬ 
able  [Pesurl,  86]  [Shafer,  76]  [Reynolds  et  al  ,  86].  This 
approach,  however,  is  not  suitable  for  our  purpose.  First  of 


all,  in  our  model  the  prior  knowledge  about  the  image  ele¬ 
ments  is  global  and  statistical  -  it  is  their  joint  probability 
distribution.  In  contrast,  a  piece  of  evidence  from  local 
observation  bears  directly  upon  an  individual  element.  It  is 
not  computationally  feasible  to  maintain  "marginal  belief* 
for  individual  image  elements  by  summing  the  joint  proba¬ 
bilities  whenever  a  new  piece  of  evidence  becomes  available. 
Moreover,  a  piece  of  new  evidence  about  a  particular  element 
influences  the  belief  distributions  of  the  rest  of  elements.  We 
believe  that  an  information  fusion  mechanism  should  con¬ 
stantly  maintain  a  representation  of  knowledge  to  reflect  the 
total  information  available,  except  possibly  transient  periods 
of  time  for  aggregating  evidence  locally.  This  requirement  is 
alsc  desirable,  if  not  necessary,  for  implementing  vision  sys¬ 
tems  in  distributed  environments.  Maintaining  "marginal 
belief  requires  the  effects  of  updating  local  "belief*  to  be 
spatially  propagated,  thus  violating  such  a  requirement. 

In  this  section,  we  limit  our  attention  to  an  individual 
element  s  of  S.  We  show  how  the  opinions  about  s  can  be 
combined  and  provide  the  probabilistic  justification  for  the 
proposed  method.  In  Section  4  we  show  how  the  updating  of 
this  joint  probability  distribution  given  a  new  set  of  opinions 
about  a  set  of  primitive  elements  can  be  carried  out  with 
simple  operations. 

3.1.  Representations  and  Combination  Rules 

As  in  Pearl’s  construction  [Pearl,  86],  we  assume  the 
segmentation  labels  can  be  organized  as  a  hierarchical  tree 
H  (e  g.  Figure  1).  Node  l  denotes  the  hypothesis  that  the 
corresponding  primitive  element  is  of  label  l,  i.e.,  X,  =  l. 
Each  internal  node  stands  for  the  disjunction  of  its  sons. 
Unlike  other  approaches,  the  numbers  maintained  in  our 
method  do  not  indicate  the  states  of  belief  of  the  hypotheses 
but  rather  the  degrees  of  hypothesis  confirmation  or 
disconfirmation  provided  by  the  collected  evidence.  In  partic¬ 
ular  we  do  not  require  an  initial  'belief  (prior  probability) 
for  each  node  as  required  in  Pearl’s  scheme 

Let  a/  denote  the  current  degree  of 
confirmatiomdisconfinnation  for  node  /  The  probabilistic 
interpretations  for  the  a' s  wui  be  given  in  Section  3  2.  Ini¬ 
tially,  Oj  is  set  to  unity  for  every  /  indicating  "neither 
confirmed  nor  disconfirmed".  Besides  a,  each  internal  node  l 
keeps  one  value,  i of,  for  each  son  i.  Initially.  w[  is  set  to  the 
a  prior  probability  of  i  given  l.  Obviously,  the  inf's  of  each 
node  sum  up  to  unity  initially. 

Suppose  a  module  A  reports  its  opinion  as  a  set  of  likel¬ 
ihood  ratios  IXf'IItL,*}  where  LA  is  a  set  of  mutually 
exclusive  labels  contained  in  H  as  described  in  Section  1.2. 
The  corresponding  a’s  are  updated  according  to  the  following 
rules: 

ai  «-  Xi'a;  for  each  I6L*  (3.1) 

To  maintain  the  coherence  of  the  a’s,  the  effect  of  this  opin¬ 
ion  has  to  be  propagated  throughout  the  label  tree.  The  pro¬ 
pagation  procedure  can  be  described  in  terms  of  message 
passing: 

(1)  Every  node  l,  sends  a  message,  m  =  X|\  to  its 

father  and  each  of  its  sons. 


666 


Any  node  k  that  receives  a  message  m  from  its 
father,  passes  m  to  all  its  sons  and  replaces  ak  by 
mat,  that  is. 

at  *-  Afat 

(3.2) 

Any  node  j  that  receives  a  message  m  from  one  of  its 
sons  (say  i),  updates  w{  by  mur{,  i.e., 

u>{  «-  muij 

and  sends  a  message  m’  to  its  father,  where 

(3.3) 

m’  =  YuH 

i 

(3.4) 

then  updates  a,  and  all  the  w** a  according  to 

a;  «-  m'ctj 

(3.5a) 

wi 

w{  «-  — -  for  all  k 

(3.5b) 

m 


where  the  summation  in  (3.4)  is  taken  over  all  the 
sons  of  j. 

It  is  easy  to  see  that  the  combination  and  propagation 
procedures  are  commutative  and  associative,  so  their  order  is 
irrelevant. 

3.2.  Probabilistic  Justification 

The  above  method  fits  in  the  Bayesian  formalism  if  we 
maintain  two  notions  of  conditional  independence.  First,  evi¬ 
dence  0A  that  bears  directly  on  a  label  /  says  nothing  about 
the  descendants  of  l: 

P(0A\l,l ,)  =  P{Oa\1),  /,  descendant  of  l,  (3.6) 


Theorem  1:  Let  a|  denote  the  a  value  for  l  at  the  con¬ 
sistent  state  t,  and  P(0,|I)  be  the  probability  of  0,  given  the 
label  l,  where  0,  denotes  the  union  of  those  observations 
that  form  the  set  of  opinions  that  derives  the  state  t.  If 
Ot*0,  then 

a !  =  c,P(0,\l)  for  all  l  €H  (3.8) 

where  c,  is  a  constant  depending  only  on  t,  given  (3.6)  and 

(3.7). 

Due  to  the  limitation  on  the  size  of  this  paper,  we  pro¬ 
vide  here  several  keys  for  proving  Theorem  1  instead  of  a 
complete  proof.  The  effect  of  applying  rule  (3.1)  is  to  multiply 
each  label  l  in  LA  by  the  likelihood  P(0A\l)  and  labels  in 
-,LA  by  P(0A|  ->(  U  /)).  Downward  propagation,  rule  (3.2), 
ltLA 

follows  directly  from  the  conditional  independence  assump¬ 
tions  (3.6).  Upward  propagation,  rule  (3.3)  -  (3.5),  follows 
from  the  fact: 

P(0|D  = 


where  l  is  the  conjunction  of  (,'s. 

Applying  Theorem  1  and  Bayes’  rule,  we  have: 

Corollary  1:  Let  a‘  denote  the  a  value  for  l  at  the  con¬ 
sistent  state  t.  If  P0  is  the  prior  p.d.f.  of  a  set  of  mutually 
exclusive  and  exhaustive  labels  L,  then  the  posterior  proba¬ 
bility  PM)  of  Uh  at  the  consistent  state  t  is 


PM)  = 


aJPo(l) 

2alP0(l)  ' 

It  L 


P(0A\~'l,lj)  =  P(0A|-’f)  lj  descendant  of  ->/ 


Second,  the  observations  of  different  modules  are  condition¬ 
ally  independent. 


P(0\l)  _  TT  P(0A\l) 

P(Oh/)  ja1P(Oa|-/> 


(3.7) 


where  the  product  on  the  right-hand  side  is  over  a  set  of 
modules  and  0  is  the  union  of  their  observation  0A’s. 


As  suggested  by  Pearl,  (3.6)  states  that  when  the  obser¬ 
vation  0A  is  a  unique  property  of  l,  common  to  all  its  des¬ 
cendants,  once  we  know  l  is  true/false,  the  identity  of  l,  or  lj 
does  not  make  0A  more  or  less  likely.  (3.7),  implicitly  used 
in  Pearl’s  scheme,  states  that  each  piece  of  evidence  observed 
by  the  .party  modules  provides  independent  information 
about  a  label.  We  believe  that  the  disparate  types  of  image 
clues  in  vision  applications  satisfy  this  assumption. 

We  define  consistent  states  o f  a’s  as  the  states  in  which 
for  each  available  opinion,  all  of  the  a’s  are  either  updated 
according  to  rules  (3.1)  -  (3.5),  or  none  of  the  a’s  have  been 
changed  with  respect  to  this  opinion.  We  say  that  a  set  of 
opinions  derives  a  consistent  state  if  all  opinions  in  this  set, 
and  no  other  opinions,  have  been  used  to  update  the  a's. 
The  following  theorem  relates  the  a’s  to  the  likelihood  pro¬ 
babilities  at  consistent  states. 


To  summarize:  We  have  developed  an  evidence  combi¬ 
nation  method  for  a  hierarchy  of  hypotheses  based  on  the 
notions  of  conditional  independence  given  by  (3.6)  and  (3.7). 
This  scheme,  besides  having  all  the  characteristics  listed  in 
[Pearl,  86],  has  the  following  advantages: 

(1)  The  computations  involved  are  extremely  simple. 
Simpler  and  fewer  messages  must  be  passed.  Normal¬ 
izations  are  never  needed  since  relative  degrees  of 
confirmation/disconfirmation  are  maintained  instead 
of  probabilities  (Theorem  1). 

(2)  This  scheme  decouples  the  notion  of  evidence  and  a 
priori  belief.  That  is,  the  evidence  can  be  collected 
and  combined  in  the  absence  of  a  priori  belief  in  a 
probabilistic  formalism.  In  the  next  section  we  show 
this  characteristic  is  very  helpful  when  the  prior 
knowledge  is  represented  as  an  MRF. 

4.  Combining  Prior  Knowledge  with  Observations 

In  this  section,  we  move  our  attention  to  the  relation¬ 
ships  of  segments  of  the  image  S.  Recall  that  in  the  last  sec¬ 
tion,  each  primitive  element  is  associated  with  a  set  of  a’s  to 
maintain  the  opinions  of  the  early  visual  modules.  Let  ft, 
denote  the  set  of  a’s  associated  with  s€S,  and  p,(l)  be  the  a 
value  tor  label  l  in  P,.  Define  a  global  consistent  state  to  be 
a  state  of  the  P's  at  which  each  p,  is  in  a  consistent  state. 


667 


Auume  that  the  prior  knowledge  about  the  image  is 
represented  as  an  MRF  X  over  S,  X,iL  -  a  mutually 
exclusive  and  exhaustive  label  set  in  H,  with  respect  to  a 
neighborhood  system  N.  The  Gibbs  measure  that  character¬ 
izes  the  prior  MRF  is: 

Pt(u)  =  i— - -  (4.1) 

Z<) 


where 


c(C 

Given  (1.1),  Theorem  1,  and  Bayes'  rule,  the  a  posterior 
Gibbs  measure  of  a  configuration  u  at  a  global  consistent 
state  t  can  be  computed  as: 

P,(u)  =  *— -  (4.2) 

where  the  a  posteriori  energy  U 

!/<(«)  =  5><(«>  -  r2lnW(u.)>  (43) 

c€C  itS  ’ 


It  is  easy  to  see  that  the  a  posteriori  Gibbs  measure 
characterizes  a  MRF  over  S  with  respect  the  neighborhood 
system  N  with  the  local  characteristics: 


P,(X,  =  u,\Xr  =  ur,rtN,) 


2  ve<„)+wp  <<„,)) 

e  >(C' _ 

-=  2  ve(«')+b(p 

E*  * 

w' 


(4.4) 


(4.3)  and  (4.4)  indicate  that  only  simple  local  operations 
are  needed  to  update  the  energy  measure  and  local  charac¬ 
teristics  as  new  opinions  from  the  early  visual  modules 
become  available.  Therefore,  estimation  methods  depending 
only  upon  these  measures,  such  as  MAP  and  MPM  estima¬ 
tions,  can  easily  be  implemented  in  the  proposed  framework 
(Section  5).  We  believe  that  based  on  this  property,  novel 
estimation  algorithms  can  ultimately  be  designed  that  incre¬ 
mentally  improve  their  estimations  as  more  and  more  infor¬ 
mation  arrives.  For  now,  the  existing  Bayesian  estimation 
methods  can  be  invoked  at  any  global  consistent  state  to  pro¬ 
vide  the  up-to-date  estimations. 


5.  Experimental  Results 

We  demonstrate  the  method  using  two  images  of  over¬ 
lapping  rectangular  patches.  Each  patch  in  the  first  image 
corresponds  to  a  geometrically  identical  patch  in  the  second. 
The  intensities  of  the  patches  in  each  image  are  randomly 
selected  from  the  range  [0,  265],  with  no  intensity  correla¬ 
tion  between  images.  A  random  number  is  generated  for 
each  pixel  and  is  added  to  its  intensity  value  to  simulate  a 
random  noise  process.  The  distributions  used  for  the  noise  in 
the  two  images  are  Gaussian  of  zero  mean,  with  standard 
deviation  16  and  12  respectively.  These  two  images  can  be 
considered  as  two  different  sources  of  information  about  the 


same  set  of  rectangular  objects.  Figure  3.a  and  4  a  are 
halftone  displays  of  these  two  images. 

A  set  of  likelihood  edge  detectors,  an  early  version  of 
the  detectors  described  in  [Sher,  86],  provides  a  set  of  likeli- 

P(0|E*) 

hood  ratios  -  ( p(0 1 jVE)  =  "  *°r  eac*1  P“e*  (PV8n  the 

3X3  window  of  intensities  centered  at  it,  where  NE  denotes 
the  hypothesis  that  the  given  pixel  is  not  an  edge  element, 
and  E ,  denotes  the  hypothesis  that  the  given  pixel  is  an  edge 
of  one  of  the  four  (horizontal,  vertical,  and  two  diagonal) 
orientations.  The  likelihoods  are  computed  using  a  model  for 
step  edges  and  a  Gaussian  model  for  additive  noise.  Figure 
3.b  and  4.b  show  the  Maximum  Likelihood  Estimation 
(MLE)  for  edges  in  Figure  3.a  and  4.a  respectively.  That  is,  a 
pixel  s  is  on  if  and  only  if  max  /),(/)  >  P,(NE )  . 

We  use  a  homogeneous  and  isotropic  MRF  with  a  third 
order  neighborhood  system  over  the  image  lattice  to  encode  a 
body  of  basic  knowledge  about  edges.  Cliques  of  size  3  are 
used  to  discourage  parallel  and  competing  edges,  whereas 
cliques  of  size  2  are  used  to  encourage  line  continuations, 
region  homogeneity  and  to  discourage  breaks  in  the  line 
forming  process.  The  potential  assignments  for  the  cliques 
are  chosen  conservatively  in  the  sense  that  estimation 
methods  based  on  (4.2)  and  (4.3)  make  as  few  false  detections 
of  edges  as  possible  while  maintaining  reasonable  detectabil¬ 
ity. 


668 


We  have  implemented  a  program  that  does  MPM  esti¬ 
mation  based  on  the  Monte  Carlo  procedure  proposed  in 
[Marroquin,  85],  This  program  uses  the  Gibbs  sampler  that 
randomly  and  repeatedly  selects  configurations  for  each  pixel 
baaed  on  the  distribution  given  in  (4.4).  Due  to  the  fast  con¬ 
vergence  observed  in  the  experiments,  we  start  collecting 
statistics  at  the  300th  iteration.  Figure  3.c  and  4.c  show  the 
MPM  estimations  based  on  the  statistics  collected  over  300 
iterations. 

We  can  consider  that  Figures  3.a  and  4.a  provide  only 
partial  evidence  to  support  the  NE  and  £,’s  hypotheses.  For 
example.  Figure  3.a  would  represent  a  noise-corrupted  depth 
map,  Figure  4.a  would  be  an  image  of  irradiance,  and  E, 
would  correspond  to  the  hypotheses  that  the  pixel  is  at  a 
depth  discontinuity  or  at  an  edge  purely  caused  by  pfco- 
tometrical  effects  Under  this  interpretation,  Figure  5.a  and 

5. b  show  the  MLE  and  the  MPM  estimations  resulting  from 
applying  the  upward  propagation  rule  (3.3H3.5)  with  the 
initial  wg  —  0.75,  where  D  stands  for  depth  edge,  to  com¬ 
bine  the  partial  evidence  of  Figure  3. a  with  the  information 
provided  by  Figure  4.a. 

Alternatively  we  can  consider  that  Figure  3. a  and  4  a 
independently  support  the  same  set  of  hypotheses.  By  apply¬ 
ing  rule  (3.1),  we  obtain  Figure  6. a  and  €.b  representing  the 
MLE  and  the  MPM  estimations  based  on  the  combined  infor¬ 
mation.  Observe  the  lines  detected  in  the  lower  left  qua¬ 
drant  of  Figure  6.b  that  do  not  show  up  in  either  of  Figure 
3.c  and  4.c,  and  the  false  detections  in  Figure  3.c  and  4.c 
that  are  removed  in  Figure  6.b.  These  sorts  of  results  can 
not  be  achieved  by  multi-modal  segmenters  that  rely  on 
Boolean  operations  to  combine  evidence 

6.  Conclusions  and  Future  Research 

We  have  presented  a  new  approach,  based  on  Bayesian 
probability  theory,  for  integrating  information  from  various 
sources.  This  approach  decouples  the  notion  of  observable 
evidence  and  prior  knowledge  so  that  each  type  of  informa¬ 
tion  can  be  treated  according  to  its  own  characteristics.  We 
have  shown  how  to  combine  coherently  and  consistently 
different  pieces  of  evidence  for  a  set  of  hypotheses  organized 
as  a  hierarchical  knowledge  tree  and  how  to  incorporate 
prior  knowledge  modeled  by  MRFs.  We  have  explicitly  listed 
the  set  of  independence  assumptions  we  need  and  have  pro¬ 
vided  the  probabilistic  justification  for  this  approach.  Exper¬ 
iments  on  synthetic  data  have  shown  promising  results. 
There  is  still  much  to  do,  however.  The  MRF  model  we  used 
is  very  rudimentary.  Much  more  prior  knowledge  can  be 
encoded  using  the  same  concept.  We  are  currently  improv¬ 
ing  the  MRF  model  to  handle  curved  lines.  One  of  our  ulti¬ 
mate  goals  is  to  encode  all  kinds  of  geometrical  and  pho- 
tometrical  constraints  in  terms  of  local  clique  potentials  in 
an  MRF  that  has  general  connectivity.  The  stochastic  esti¬ 
mation  methods  are  computationally  very  expensive.  We  are 
now  designing  a  deterministic  estimation  algorithm  that 
incremently  improves  its  estimation  as  new  evidence  arrives. 
We  believe  this  method  of  information  fusion  can  be  applied 
to  problems  other  than  image  segmentation  as  well. 


Ackowledgments 

We  would  like  to  thank  Dave  Sher  for  providing  his 
edge  detectors  and  many  valuable  suggestions.  We  also 
thank  Rajeev  Raman  for  his  help  in  writing  the  code  for  the 
experiments. 

This  work  was  supported  by  the  Air  Force  Systems 
Command,  Rome  Air  Development  Center,  Griffiss  Air  Force 
Base,  New  York  13441-5700,  and  the  Air  Force  Office  of 
Scientific  Research,  Bolling  AFB,  DC  20332  under  Contract 
No.  F30602-85-C-0008.  This  contract  supports  the  Northeast 
Artificial  Intelligence  Consortium  (NAIC).  This  work  waB 
also  supported  by  the  U.S.  Army  Engineering  Topographic 
Laboratories  under  Contract  No.  DACA76-85-C-hr>ni 
References: 

Barrow,  H.  G.,  and  J.  M.  Tenenbaum,  Recovering  intrinsic 
scene  characteristics  from  images,  in  Computer  Vision 
Systems,  A.  R.  Hanson  and  E.  M  Riseman  ed.,  Academic 
Press,  1978. 

Bolle,  R.  M.  and  D.  B.  Cooper,  Bayesian  Recognition  of  Local 
3-D  Shape  by  Approximating  Image  Intensity  Functions  with 
Quadric  Polynomials,  IEEE  Transactions  on  Pattern  Analysis 
and  Machine  Intelligence,  Vol.  PAMI-6,  No.  4,  pp.  418-429, 
July  1984. 

Bolles,  Robert  C.:  Verification  Vision  for  Programmable 
Assembly, Proceedings'.  IJCAI,  1977. 

Cross,  G.  R.,  and  A.  K.  Jain,  Markov  Random  Field  Texture 
Models,  IEEE  Trans,  on  Pattern  Analysis  and  Machine 
Intelligence,  PAMI-5,  25-39, 1983. 

Derin,  H.,  and  W.  S.  Cole,  Segmentation  of  Textured  Images 
Using  Gibbs  Random  Fields,  Computer  Vision,  Graphics,  and 
Image  Processing  35, 72-98, 1 986. 

Elliott,  H.,  and  H.  Derin,  Modelling  and  Segmentation  of 
Noisy  and  Textured  Images  Using  Gibbs  Random  Fields, 
Technical  Report  #ECE-UMASS-SE84  1,  University  of 
Massachusetts,  1984. 

Feldman,  J.  A.,  and  Y.  Yakimovsky,  Decision  Theory  and 
Artificial  Intelligence:  I.  A  Semantics-Based  Region  Analyzer, 
Artificial  Intelligence  5,  349-371, 1974. 

Geman,  Stuart  and  Donald  Geman,  Stochastic  Relaxation, 
Gibbs  Distributions,  and  the  Baysian  Restoration  of  Images, 
IEEE  Trans,  on  Pattern  Analysis  and  Machine  Intelligence, 
PAMI-6,  No.  6,  1984. 

Crimson,  W.  E.  L.,  From  image  to  Surfaces:  A  computational 
study  of  the  human  early  visual  system,  Cambrige,  MA,  MIT 
Press,  1981. 

Hassner,  Martin  and  Jack  Slansky,  The  Use  of  Markov 
Random  Fields  as  Models  of  Texture,  in  Image  Modeling, 
Azriel  RosenTdd  (ed.) ,  Academic  Press,  Inc.,  1980,  185-198. 

Hummel,  R.,  and  S.  Zucker,  On  the  Foundation  of  Ralaxation 
Labeling  Process,  IEEE  Transactions  on  Pattern  Analysis  and 
Machine  Intelligence,  Vol.  PAMI-5, 1983,  267-286. 

Ikeuchi,  K.,  and  B.  K.  P.  Horn,  Numerical  shape  from  shading 
and  occluding  boundaries.  Artificial  Intelligence,  15,  141-184, 
1981. 

Kindermann,  Ross  and  J.  Laurie  Snell,  Markov  Random 
Fields  and  their  Applications,  American  Mathematical 
Society,  1980. 


669 


Marr,  David,  Vision,  San  Francisco:  W.  H.  Freeman,  1982. 

Marroquin,  Jose  L.,  Probabilistic  Solution  of  Inverse 
Problems,  Technical  Report  860,  MIT  Artificial  Intelligence 
Laboratory,  1985. 

Pearl,  Judea:  On  Evidential  Reasoning  in  a  Hierarchy  of 
Hypotheses,  Artificial  Intelligence,  16,  No.2,  Feb  1986. 

Rosenfeld,  A.,  R.  A.  Hummel,  and  S  W.  Zucker,  Scene 
labelling  by  relaxation  operations.  IEEE  Trans.  SMC  6,  1976, 
420-433. 

Shafer,  G.,  A  Mathematical  Theory  of  Evidence,  Princeton 
University  Press,  1976. 

Sher,  David,  Optimal  Likelihood  Detectors  for  Boundary 
Detection  Under  Gaussian  Additive  Noise,  Proceedings:  IEEE 
Conference  on  Computer  Vision  and  Pattern  Recognition, 
Miami,  Florida,  June  1986. 

Sher,  David,  Advanced  Likelihood  Generators  for  Boundary 
Detection,  Technical  Report  197,  Computer  Science 
Department,  University  of  Rochester,  1987. 

Shvaytser,  H.,  and  Shmuel  Peleg,  A  New  Approach  to  the 
Consistent  Labeling  Problem,  Proceedings :  IEEE  Conference 
or  Computer  Vision  and  Pattern  Recognition,  85,  320-327. 

Terzopoulos,  D.,  Integrating  Visual  Information  from  Multiple 
Sources,  in  From  Pixel  to  Predicates,  Alex  P  Pentland  (ed  ), 
Ablex  Publishing  Co  ,  1986,  111-142. 

Witkin,  Andrew,  Recovering  Surface  Shape  and  Orientation 
from  Texture,  Artificial  Intelligence  17,  1981,  17-45 


\ 

I 


l 

t 


MINIMIZATION  OF  THE  QUANTIZATION  ERROR  IN 
CAMERA  CALIBRATION 

Behrooz  Kamgar-Parsi 


Center  for  Automation  Research 
University  of  Maryland 
College  Park,  MD  20742 


ABSTRACT 

It  is  shown  that  the  quantization  error  due  to  the  lim¬ 
ited  resolution  capabilities  of  existing  cameras  is  so  serious 
that  the  calibration  of  two  camera  stereo  in  an  uncontrolled 
environment  is  frequently  unreliable.  This  is  done  by  de¬ 
riving  explicit  analytical  solutions  for  the  relative  pan  and 
tilt  angles  in  terms  of  the  world  pan  angle  (often  referred 
to  as  gaze  angle)  and  the  coordinates  of  the  image  points 
used  in  their  computation  (under  the  assumption  that  an 
estimate  of  these  angles  is  available).  The  degree  of  the  im¬ 
pact  of  the  quantization  error,  however,  depends  highly  on 
the  image  points  that  are  used  for  the  computation  of  the 
camera  orientations.  That  is,  to  obtain  reliable  results  the 
image  points  must  be  discriminated  on  the  basis  of  their 
coordinates  and,  at  times,  their  horizontal  disparities.  Not 
only  this,  but  also  the  combination  of  image  points  used  for 
the  computation  of  a  given  angle  must  form  special  config¬ 
urations.  These  findings,  obtained  by  deriving  analytical 
expressions  for  the  magnitude  of  the  error  in  the  computed 
values  of  the  orientation  angles,  indicate  that  points  and 
combinations  of  points  used  for  calibration  must  be  selected 
intelligently.  A  scheme  is  presented  to  choose  (from  the 
available  image  points)  the  appropriate  points  and  combi¬ 
nations  of  points  that  give  rise  to  the  most  accurate  solu¬ 
tion. 

1.  INTRODUCTION 

Tasks  such  as  scene  analysis,  eye-hand  coordination  and 
navigation  require  depth  information.  A  standard  way  of 
obtainiitg  depth  information  is  to  extract  it  from  stereo  im¬ 
age  pairs.  For  example,  this  technique  can  be  used  on  an 
autonomous  vehicle  which  would  “look”  at  the  road  using 
two  cameras  and  create  a  stereo  pair  of  images.  Depth  in¬ 
formation  would  then  be  used  to  build  a  3-D  representation 
of  the  world  facing  the  vehicle.  Utilizing  this  3-D  represen¬ 
tation,  the  vehicle  would  be  able  to  plan  ahead  and  follow 
the  road. 

The  accuracy  of  the  3-D  representation,  however,  de¬ 
pends  on  the  accuracy  with  which  the  orientations  of  the 
two  cameras  are  known.  Although  for  nearby  objects  small 
inaccuracies  in  camera  orientations  may  not  be  of  much 


significance,  for  objects  that  are  not  close  to  the  stereo  sys¬ 
tem  the  situation  is  quite  different.  In  fact,  in  the  presence 
of  inaccuracies  in  camera  orientations,  the  accuracy  of  the 
3-D  representation  can  deteriorate  drastically  with  depth. 
This,  in  turn,  will  curtail  the  capabilities  of  the  vehicle  to 
plan  ahead  or  to  plan  correctly. 

Some  investigators  have  addressed  the  topic  of  stereo 
error  due  to  quantization  but  they  have  not  discussed  its 
impact  on  the  problem  of  camera  calibration  [1],  [2].  Other 
investigators  have  discussed  the  question  of  camera  cali¬ 
bration  without  considering  the  effect  of  quantization  error 
thereby  limiting  the  practicality  of  their  approach  [3],  [4]. 
In  addition  to  stereo  error  analysis,  we  have  discussed  prac¬ 
tical  difficulties  in  camera  calibration  and  ways  to  handle 
them. 

It  is  shown  [5]  that,  in  stereo,  precise  knowledge  of  the 
relative  pan  angle  of  the  two  cameras  with  respect  to  each 
other  (and  to  a  lesser  degree  the  relative  tilt  angle)  is  cru¬ 
cial  to  the  accurate  3-D  recovery  of  object  points  in  space, 
whereas  accurate  knowledge  of  the  pan  and  tilt  angles  rel¬ 
ative  to  the  scene,  i.e.  the  “world”  angles,  is  of  less  signif¬ 
icance.  Therefore,  the  most  important  task  would  be  the 
computation  of  the  relative  pan  and  tilt  angles.  If  possible 
the  world  pan  angle  should  also  be  computed.  The  world 
tilt  angle  cannot  be  computed. 

It  is  assumed  that  the  stereo  system  consists  of  two  iden¬ 
tical  and  synchronized  cameras  that  can  pan  and  tilt  inde¬ 
pendently,  but  are  installed  at  a  fixed  distance  from  one 
another.  The  fixed  line  segment  defined  by  the  centers  of 
rotation  of  the  two  cameras  is  called  the  baseline,  and  the 
world  coordinate  system  is  defined  as  follows:  Take  the 
center  of  the  coordinate  system  to  be  the  point  midway 
between  the  centers  of  rotation  of  the  two  cameras,  the 
X-axis  to  point  from  the  right  camera  to  the  left  camera, 
the  Y-axis  to  be  vertically  upward  and  the  Z-axis  straight 
ahead.  Similar  axis  orientations  are  observed  for  coordi¬ 
nate  systems  attached  to  the  right  and  the  left  cameras. 
The  center  of  the  coordinate  system  for  either  camera  will 
be  its  center  of  rotation. 

Let  6  be  the  pan  (counterclockwise  Y  rotation)  and  <t> 
the  tilt  (counterclockwise  X  rotation)  of  the  left  camera 
coordinate  system  with  respect  to  the  right  camera,  and 
let  a  be  the  counterclockwise  pan  angle  of  the  right  camera 


671 


Z, 


Thia  diagram  shows  the  orientation  of  the  X  and  Z  axee  for  the 
world,  the  right  and  the  left  coordinate  ayatema,  whoae  centera 
are  pointa  W,  R,  and  L,  respectively.  The  angle  between 
Xw  and  Xr  ia  a,  and  the  angle  between  X .  and  X,  ia  a  + t. 

Figure  1 

with  respect  to  the  world  coordinate  system. 

If  (Xr,Yr,Zr)  are  the  coordinates  of  a  given  point  in 
space  with  respect  to  the  right  camera  coordinate  system 
and  (Xi,  Yt,  Zi)  are  those  with  respect  to  the  left  coordinate 
system,  we  have 

(X,\  (  coal  ainlaind  ainlcoa*  W  X,  \  /  icoea  \ 

I  Yr  1  =  1  o  coed  -aind  II  li  l  +  l  0 

\  Z.  )  V  -»!«•  coalain^  caalcoad  )  \  Z,  )  \  laina  )  (J) 

where  b  is  the  length  of  the  baseline. 

It  is  obvious  that  there  is  no  advantage  to  allowing  the 
difference  in  the  tilt  angles  of  the  two  cameras,  i.e.  <f>,  to  be 
large.  Also,  we  assume  that  an  estimate,  8o  of  8  is  available. 
That  is,  we  have  8  =  80  +  d8 ,  where  d8  is  a  small  angle. 

It  is  shown  that  precise  knowledge  of  8  (and  to  some 
degree  is  crucial  to  the  accurate  3-D  recovery  of  object 
points  in  space,  whereas  accurate  knowledge  of  a  is  of  less 
importance.  On  the  other  hand,  as  we  shall  see,  compared 
to  8  and  <f>,  the  numerical  computation  of  a  is  more  likely 
to  produce  an  erroneous  result.  Therefore,  although  the 
objective  is  to  compute  all  three  angles,  we  have  shown 
that  it  is  often  possible  to  bypass  the  calculation  of  a  and 
to  compute  8  and  <f>  directly.  Thia  is  despite  the  fact  that 
in  the  analytic  formulation  of  the  problem  the  three  angles 
are  intertwined.  It  has  to  be  noted  that,  in  general,  we  have 
assumed  that  an  estimate  of  the  angle  a  is  available. 

3.  ANALYTICAL  EXPRESSIONS  FOR 
RELATIVE  PAN  AND  TILT  ANGLES 

After  expanding  the  sine  and  the  cosine  functions  (aside 
from  those  for  a)  in  equation  (1)  and  ignoring  terms  in  the 
second  order  in  d8  and  <b,  that  equation  may  be  written  as 

X'K  m  (*,«wlo-/«ialo)A<-(/eo*#0+*niiil0)Ald#+*cMa  (2) 


yrAr  —  j/|Ai  -I-  /A i<j>  (3) 

-/A,  =  -(liCO«#o-/«in#0)AJ!il+yiA,p-{/co*lo+ii*iu#o)Ai+6aina  (4) 

where  /  is  the  focal  length  of  either  camera,  and  (x„yr) 
and  (xi,yi)  are  the  right  and  the  left  projected  coordinates 
of  a  given  point  in  space.  That  is,  xr  =  —  JXrfZ,  and 
yr  =  —  fYr / Z, .  xi  and  yi  are  defined  similarly.  Ar  and  Ai 
are  defined  as  —Zr/f  and  - Ziff ,  respectively. 

Assuming,  for  the  moment,  that  a  is  known,  we  have 
three  equations  in  four  unknowns,  i.e.  d6,4>,  A,  and  Aj.  We 
can  rewrite  equations  (2)  and  (3)  as  follows: 


d8  = 


X|Aj  —  zrAr  +  yi  Ai  sin  6o4>  +  6cos  a 
(/cos  $o  +  xisin 
VtK  -  yiA( 


<t>  = 


fh 


(5) 

(6) 


Substitution  of  (5)  and  (6)  into  (4)  will  yield  an  expres¬ 
sion  for  Ar  in  terms  of  Aj.  This  can  in  turn  be  substituted 
into  (5)  and  (6)  to  provide  expressions  for  d.0  and  <j>  in  terms 
of  Aj  only.  Doing  this  we  will  have  two  equations  in  three 
unknowns,  i.e.  d8,  <f>  and  A^.  Considering  a  second  point 
in  space  (with  known  right  and  left  image  positions),  we 
will  have  altogether  four  equations  in  the  following  four 
unknowns:  d8,  <j>,  An  and  A2j.  After  considerable  algebra 
An  and  A2i  are  eliminated  and  the  expressions  for  d8  and  <j> 
simplify  to  these: 


a  _  11  -  fii)(oj|iftr  -  oirlftilK/*  +  |/i,yuco«lo)cj«o  +  (/xi,  -  tfi.lfii  »in  Ip)  ain  a] 
+  «.)]  -  +  Vl.Vll  cob(d  + »o)l 

.  _  y»y»r»i,tv  -  y»iyi.°iri>n  +  yiry»f(c»i4u  -  eudy) 

~  Virb'ulfa,,  +  yi.yu  coa(a  +  l0)|  -  yi,6'„[/oi,  +  ylryuco«(a  +  9o)| 


(7) 

(8) 


where  Pit  is  a  permutation  operator  (switching  the  indices 
l.and  2  in  the  expression  to  which  it  is  applied),  and  where 


dir  =  f  cos  a  +  xir  sin  a  (9) 

a,|  =  /  cos  a  +  Xu  sin  a  (10) 

bn  =  Xu  cos  a  -  /  sin  a  (ll) 

c,i  =  /  cos  80  +  Xu  sin  80  (12) 

du  =  Xu  cos  0O  -  /  sin  0o  (13) 

a’,  =  cu  cos  a  +  du  sin  a  (14) 

b'u  =  du  cos  a  -  cu  sin  a  (15) 

Also  the  following  definitions  will  be  useful: 

H  =  xiryu  -  xuVir  (16) 

<5,y  =  y*r  -  y.l  (17) 

Pi  =  /*  +  VirVil  (18) 

l]  =  xiryu  -  dilVir  (19) 

Pi  =  /*  +  VirVil  COS  00  (20) 


Numerical  computation  of  dO  ahd  4>  is  composed  of  the 
calculation  of  several  intermediate  quantities  (such  as  7,0, 
etc.).  In  order  to  discuss  the  computational  difficulties,  we 


672 


need  to  examine  the  magnitudes  of  these  quantities  sepa¬ 
rately.  Presently  these  quantities  are  expressed  in  terms  of 
xr,xi,yr  and  yi  (of  a  given  point  in  space),  which  makes 
it  difficult  to  estimate  their  magnitudes.  This  is  because 
for  a  point  in  space  not  all  x  and  y  coordinates  of  both 
left  and  right  image  positions  can  be  considered  as  inde¬ 
pendent  quantities.  For  our  stereo  system,  in  particular, 
y,  and  yi  are  tightly  correlated.  Therefore  in  Appendix  A 
we  have  eliminated  yj  and  have  derived  expressions  for  the 
intermediate  quantities  in  terms  of  dd,<t>,xr,  x,  and  yr. 

3.  COMPUTATION  OF  RELATIVE  PAN 
ANGLE  WHEN  WORLD  PAN  ANGLE  IS  ZERO 

To  emphasize  the  computational  difficulties,  first  we  fo¬ 
cus  on  the  computation  of  dB  for  the  less  complicated  case 
of  8 o  =  a  =  0.  Necessary  modifications  for  the  case  of 
arbitrary  values  of  0O  and  a  will  be  discussed  later. 

Setting  a  to  zero,  equation  (7)  reduces  to 


dB  =  ~  ¥«)(/*  +  yn-i/ii)  ~  (yn-  ~  Vu)(/3  +  y2,y2l)j 

XliyiriP  +  y-lryu)  -  X2ly2 r(P  +  yirVll) 

(21) 

If  we  were  dealing  with  a  system  where  z’s  and  y’s  could 
be  obtained  with  high  precision,  then,  using  the  above  ex¬ 
pression,  any  two  points  in  space  (which  did  not  happen 
to  give  rise  to  a  vanishing  denominator)  could  give  us  an 
accurate  value  for  d6. 

However,  z’s  and  y's  can,  at  best,  be  obtained  to  the 
precision  of  a  pixel.  This  makes  the  numerical  computa¬ 
tion  of  the  angle  cL6  a  difficult  task.  Precautions  have  to 
be  taken  to  ensure  that  all  the  mathematical  operations 
required  during  the  computation  of  d9  produce  acceptable 
values.  The  only  operations  involved  are  additions,  sub¬ 
tractions,  multiplications  and  divisions.  As  is  the  case  in 
most  numerical  computations  involving  these  four  opera¬ 
tors,  the  two  that  are  most  likely  to  intensify  errors  are 
subtractions  and  divisions.  Divisions  are  prone  to  produc¬ 
ing  unacceptable  results  when  their  denominators  are  small. 
Subtractions,  however,  may  lead  to  inaccurate  results  when 
their  operands  are  almost  equal. 

Computation  of  dB  requires  one  division  and  several  sub¬ 
tractions.  To  know  whether  the  computation  will  be  reli¬ 
able  or  not,  one  needs  to  examine  the  magnitudes  of  the 
operands  of  each  of  these  operators.  Among  the  subtrac¬ 
tions,  two  are  those  that  define  vertical  disparities  of  the 
points  in  space.  We  must  therefore  find  out  how  reliable  the 
computation  of  vertical  disparity  is.  Assuming  that  correct 
matches  have  been  found,  y,  and  yi  can  still  be  erroneous 
up  to  half  a  pixel.  This  implies  that  vertical  disparity  can 
be  erroneous  up  to  a  full  pixel.  To  know  how  significant 
this  error  is  we  need  to  have  an  estimate  of  the  magnitude 
of  the  vertical  disparity.  For  this,  in  equation  (A7),  we  set 
a  to  zero.  To  the  first  order  in  dB  and  <j>  that  equation  may 
be  written  as 


Vr  ~~  Vi  —  (p  +  vt)4>/f  -  Xiy,d6/  f  (22) 

Note  that  if  we  start  from  equation  (A6),  then  instead 
of  yi  on  the  right  hand  side  of  equation  (22)  we  will  have 
yr.  The  rest  of  the  expression  will  be  the  same.  Now,  if  4' 
is  negligible,  then 


6V  a  -(xiyt/f)d8  (23) 

To  see  how  small  this  quantity  is  we  write  8  in  terms  of 
degrees  and  /  in  terms  of  n  and  t>,  where  n  is  the  number 
of  pixels  on  every  row  (or  column)  of  the  image  plane  and 
v  the  view  angle  of  the  camera: 


Sy~  (n/2)cot(u/2)('°17e) 
yr,yi  and  zj  are  now  in  (units  of)  pixels. 


(24) 


For  a  typical  camera  we  have  roughly  n  =  512  and 
t>=50  degrees.  For  the  largest  possible  value  that  zi  and 
y,  can  have,  i.e.  256  pixels,  the  vertical  disparity  will  be 
|yr  —  yi|  =  2.085.  For  5=. 5  degrees,  the  vertical  disparity 
will,  therefore,  be  1  pixel  and  that  is  within  the  range  of 
error.  This  may  suggest  to  increase  the  accuracy  of  the 
computation  of  8,  one  should  attempt  to  enhance  the  mag¬ 
nitude  of  the  vertical  disparity  by  increasing  the  vergence 
angle  or  by  keeping  the  other  term  in  equation  (22).  The 
latter  is  possible  by  intentionally  keeping  the  wo  cameras 
at  different  tilts.  Such  strategies,  however,  are  lot  signifi¬ 
cantly  helpful.  This  is  because  in  the  numerator  of  (21)  the 
two  expressions  p  +  ylryu  and  p  +  y2ry2i  are  roughly  of 
equal  magnitudes.  Therefore,  the  middle  subtraction  in  the 
numerator  of  equation  (21)  is  basically  a  subtraction  of  two 
vertical  disparities  implying  that  any  attempt  to  increase 
vertical  disparities  would  leave  the  difference  between  the 
two  terms  (y2r  -  y2t){P  +  y,ryu)  and  (ylr  -  yu}(P  +  y2ryu) 
relatively  unchanged. 

We  now  discuss  the  effect  of  the  other  subtractions.  In 
tne  numerator  of  equation  (21),  aside  from  the  two  sub¬ 
tractions  that  define  vertical  disparities  of  the  two  points 
in  space,  there  is  only  one  more  subtraction.  It  should  be 
noted,  however,  that  there  is  a  hidden  subtraction  which 
does  not  appear  in  (21)  and  yet  requires  attention.  Rear¬ 
ranging  terms  in  the  numerator  of  equation  (21),  we  can 
write 


d9  _  /[(yn  -  w){P  ±  yuytr)  -  (yu  -  ytr)(P  +  y»y«)] 
SiiVir(/J  +  yai-ya)  -  x2ly2r (/*  +  VuVu) 

(25) 

Here  we  note  that  a  successful  computation  of  dB  re¬ 
quires  large  values  of  y2r  -  y]r  (and  also  y«  -  yu)- 

This  issue  is  easy  to  resolve:  Points  that  are  paired 
for  computation  of  dB  should  be  vertically  far  apart.  This 
implies  that  we  must  not,  in  an  attempt  to  improve  our 
estimation  of  dB,  pair  every  point  with  known  right  and 
left  image  positions  with  all  others  (and  then  average).  It 
is  much  better  to  average  over  a  relatively  few  well  paired 
points. 


673 


We  now  go  back  to  equation  (21)  and  examine  the  re¬ 
maining  subtraction  in  the  numerator,  i.e.  the  middle  one. 
Unfortunately,  the  operands  of  this  subtraction  are  almost 
equal  causing  the  subtraction  to  easily  produce  erroneous 
results.  To  minimize  the  adverse  effects  of  this  subtraction, 
we  need  to  make  the  two  operands  as  different  as  we  can 
by  choosing  appropriate  points  and  pairs  of  points.  To  find 
out  where  these  points  are  located,  we  use  equation  (A10) 
which  does  not  involve  y(.  After  setting  a  =  0,  we  find  the 
analytical  result  of  the  subtraction  in  question,  num  (which 
is  essentially  the  numerator  in  the  expression  for  dff),  to  be 
the  following: 

num  =  [zus/ir (/*  +  y\r)  -  x2,y2r (/*  +  y\r)\d8/f  (26) 

Ideally  we  want  to  maximize  num  by  choosing  appropri¬ 
ate  values  for  the  four  quantities,  Xu,  Xji,  yi,  and  y2r.  The 
only  constraint  is  that  ylr  and  y2r  must  be  considerably  dif¬ 
ferent.  With  this  in  mind  one  can  easily  verify  that  num  ran 
be  maximized  by  setting  xn  =  x2J  =  a,  and  ylr  =  -y2r  =  a, 
where  2a  is  the  length  (or  the  width)  of  the  image  plane. 
Practically  speaking  this  means  that  on  the  left  image  we 
should  look  for  image  points  near  the  four  corners  (this  is 
because  yi  is  not  too  different  from  yr),  and  calculate  ver¬ 
tical  disparity  by  finding  their  matches  on  the  right  image. 
We  should  then  pair  points  near  the  top  right  corner  of 
the  left  image  with  points  near  the  bottom  right  corner  of 
that  image.  Likewise,  we  pair  image  points  near  the  top 
left  corner  with  points  near  the  bottom  left  corner.  It  is 
easy  to  verify  that  so  long  as  we  choose  such  pairs  of  points 
the  denominator  in  the  expression  for  dB  cannot  be  small, 
implying  that  the  division  involved  in  the  computation  of 
dO  will  be  basically  an  error-free  operation. 

Now  that  we  have  decided  what  image  points  we  want 
to  use,  it  is  possible  to  make  a  rough  approximation  of  the 
error  we  would  incur  in  the  computation  of  8.  This  is  done 
in  Appendix  B.  The  result  is  that  on  the  average  while  using 
the  prescribed  points  the  error  is  .12  degrees.  To  know  how 
serious  the  quantization  error  can  be  if  points  close  to  the 
middle  of  the  image  are  used,  we  note  that  one  can  derive 
the  following  expression  for  the  error  A«  in  8: 

a  /(A* v  -  Aiv) 

A,  = - 2a* - 

where  A2b  and  A2v  are  the  errors  in  the  vertical  disparities 
of  points  1  and  2. 

From  the  above  expression  the  maximum  error  (for  cor¬ 
ner  points)  is  found  to  be  //(a1)  =*  .49  degrees.  Now  if  we 
use  the  same  point  configuration  but  pick  the  points  closer 
to  the  cente’-  of  the  image,  then  a  in  the  above  formulas  will 
be  smaller  indicating  that  the  quantization  error  is  larger. 
For  example,  if  we  choose  the  points  that  are  located  half 
way  between  the  corners  and  the  center  of  the  image,  then 
a  will  be  half  what  it  is  for  the  corner  points,  and  both  the 
average  error  and  the  maximum  error  will  four  times  larger. 


It  should  be  noted  that  in  the  absence  of  image  points 
from  corners  with  the  same  latitude,  image  points  from 
corners  having  the  same  altitude  may  be  paired  provided 
that  their  y’s  Me  sufficiently  different. 

3.1.  Modifications  for  Large  Values  of  Vergence  An¬ 
gle 

If  we  set  a  =  0,  equation  (7)  reduces  to 

M  _  (c^ilir  -  fyyUP  +  yirViico«to)  -  (cnVi,  -  /Dll  )(/*  +  Vl.jlll  cos  Op) 

duVu(P  +  Vlr Vu  COB  So)  -  itm.(P  +  Vlrlfll  COS  »o)  ^27) 

In  analogy  with  the  case  80  =  a  =  0,  we  conclude  that 
the  best  pairs  of  points  (for  the  case  when  Bo  is  not  zero)  are 
those  for  which  yu  —  — y2i  a  and  du  ~  d2|  has  its  maxi¬ 
mum  magnitude.  Since  di  =  xi  cos  B0  —  f  sin  80,  when  B0  >  0 
the  two  corners  that  yield  ij  =  —  a  are  the  best  neighbor¬ 
hoods  from  which  to  choose  the  two  points.  Although  for 
small  values  of  60  the  other  two  corners  (yielding  xj  =  a)  are 
almost  as  good,  for  large  values  of  80  these  corners  should 
be  avoided.  This  is  because  the  value  of  di  can  be  too  small 
in  these  corners.  For  example,  if  tan0o  —  a//  then  for 
these  corners  we  have  dj  ~  0.  In  these  situations  points 
that  are  horizontally  near  the  middle  of  the  image  (xi  ~  0) 
are  preferred  to  corners  with  ij  ~  a. 

Employing  the  mathematical  tools  developed  by  Kamgar- 
Parsi  and  Kamgar-Parsi  [6]  one  can  show  when  points  from 
the  appropriate  corners  are  used,  then  the  average  value  of 
the  digitization  error  in  8  is  given  by 

ZpP  +  *:•  +  o1/1  sin1 90  ZHt*  +  a/«in>0)T  +  (**  -  a / sinflp)7  -  2fc[<] 
12aPk(acosto  +/sin0o)  7!o3/skft{acos  Sn  +  j  sin  0O)  sin3  $0 

where  k  =  /  cos  80  -  asinfio- 

The  above  expression  is  good  up  to  80  30  degrees.  If 

B0  ~  30  and  the  points  that  are  used  come  from  the  bad 
corners  the  quantization  error  will  be  very  large.  This  is 
because  for  these  corners  a  (in  the  above  formula)  must  be 
replaced  by  —a,  causing  the  expression  acos  80  +  [ sin  60  in 
the  denominator  to  be  small.  Finally,  one  can  verify  that 
for  the  case  8o  =  0  the  error  given  in  the  above  formula 
reduces  to  the  one  derived  in  Appendix  B. 

3.2.  Modifications  for  Large  Values  of  World  Pan 
Angle 

When  a  is  small  the  situation  is  more  or  less  as  it  is  when 
a  is  zero.  However,  for  large  values  of  a,  terms  involving 
sina  become  significant,  and,  therefore,  their  impact  on  the 
calculation  of  the  intermediate  quantities  such  as  num  has 
to  be  investigated.  From  equation  (A10)  we  see  that  the 
dependence  of  num  on  a  is  through  the  a,y’s,  a,i’s  and  bn’s. 
However  insensitive  the  Ojr’s  and  an's  may  be  to  the  value 
of  a,  the  bn’s,  on  the  other  hand,  are  quite  sensitive. 

Assuming  a  >  0,  for  a  given  negative  value  of  in,  the 
magnitude  of  bn  will  increase  as  a  increases,  whereas  for 
a  given  positive  value  of  x,i,  the  magnitude  of  bn  declines 
(it  can  even  vanish  and  then  increase  again)  as  a  increases. 
The  situation  is  reversed  for  a  <  0. 

In  other  words,  standing  behind  the  cameras,  if  we  pan 


674 


(42) 


them  to  the  left  (right),  then  a  pair  of  points  from  the  right 
(left)  corner  of  the  left  image  will  give  a  more  accurate  value 
for  d$  (as  compared  to  o  =  0),  while  those  from  the  left 
(right)  corner  will  give  a  less  accurate  value.  If  /  sin  a  and 
xu  cos  a  are  roughly  of  the  same  magnitude  but  of  opposite 
signs,  then  points  that  horizontally  are  closer  to  the  middle 
of  the  image  plane  should  be  chosen  instead. 

4.  CALCULATION  OF  WORLD  PAN  ANGLE 

When  the  world  pain  angle,  a,  is  not  known,  then  there 
will  be  three  unknowns,  i.e.  d8,  4>  and  a.  To  calculate  a 
we  write  equation  (7)  in  the  following  form: 

d9  _  Acos(2a)  +  Bsin(2a)  +C 
Z?cos(2a)  +  £Tsin(2a)  4-  F 

where 

A  =  (1  —  Pu) [p'i {cjiyir  —  fyu)  —  l'i{Sx2r  —  yirVu  sin  0o)  ]  (29) 
B  =  ( l-Pu)WiPi  +  {c2iyir-fy2i)(fxir-yuyusinB0 )]  (30) 
C  =  ( 1  —  Pi j)  [p'i  (c jiyjr  —  /j/jj  )  4-  (/ii,  -  yt,yii  sin  tf0)  ]  (31) 
D  =  (1  -  Pu)[yirduPj  4-  yi,cu(/ij,  -  y2ry2i  sinfl0)!  (32) 
E  =  (l  -  Pu)[yi,<M/xjr  -  y2ryj<  sin0o)  -  yucup2\  (33) 
P  =  (1  -  Pi2)[yird1(pj  -  yirC,j(/i2,  -  yji-yusinflo)]  (34) 

In  principle,  knowing  the  right  and  the  left  image  posi¬ 
tions  of  three  points  in  space  will  allow  us  to  compute  two 
distinct  sets  of  six  coefficients  in  (22),  i.e.  Ai,...,Fi,  and 
v42,  . . . ,  Pj.  The  following  equation  can  then  be  solved  for 
a: 

Ai  coa(2a)  +  B,  sin(2a)  +  Ci  _  AiCO«(2a)  +  B,  »in(2a)  +  Ci  (35) 

Di  coe(2a)  +  £,  »in(2a)  +  F,  D,co«(2a)  +  £,»in(2a)  +  F, 

This  approach,  however,  is  not  too  practical.  This  is 
because  the  equation  we  are  dealing  with  is  not  linear  and 
its  solution  may  not  be  stable.  Since  an  estimate,  a0,  of 
the  angle  a  is  generally  available,  we  may  assume 


a  =  ao  +  da 


(36) 


where  da  is  a  small  angle.  Therefore,  we  set  our  task  to  be 
the  calculation  of  da,  and  not  a.  Since  do  is  a  small  angle 
equation  (22)  may  be  written  as 


where 


0=  ‘+,*“ 
m  4-  nda 

(37) 

k  =  A  cos  2ao  4-  B  sin  2ao  +  C 

(38) 

l  =  2(Bcos2cro  -  /4sin2ao)da 

(39) 

m  =  D  cos  2ao  +  Es\n2ao  +  F 

(40) 

n  =  2(E  cos  2oo  -  Dsin2«o)da 

(41) 

To  calculate  each  set  of  k,l,m,n  we  need  two  points. 
Two  such  sets  will  then  yield 


kjmi  —  kimi 

(fcitij  -  kjni)  +  (Iimj  -  l2m i) 

Numerical  computation  of  da  is  highly  sensitive  to 
round-off  errors  caused  by  the  limited  resolution  capabil¬ 
ities  of  the  camera.  Indeed,  for  reasons  which  will  become 
clear  shortly,  reliable  computation  of  a  is  a  more  difficult 
task  than  that  of  8.  To  find  clues  to  the  successful  com¬ 
putation  of  a,  we  shall  briefly  review  our  approach.  We 
calculate  a  (or  da)  such  that  the  two  values  of  d8  come  out 
equal  (see  equation  (35)).  That  is,  once  a  is  computed,  not 
only  must  equation  (35)  hold,  but  also  each  side  of  it  has  to 
be  equal  to  the  true  value  of  dd.  It  is  therefore  necessary  to 
choose  those  image  points  that,  except  for  the  error  in  a, 
would  produce  reliable  values  for  6.  This  must  be  observed 
even  if  these  points  are  not  used  for  the  calculation  of  6. 
Hence,  points  and  pairs  of  points  that  are  not  qualified  for 
the  computation  of  8  shall  not  qualify  for  the  computation 
of  a  either. 

We  still  need  to  make  one  further  consideration.  Unless 
the  two  sets  of  six  coefficients  involved  in  equation  (35)  are 
sufficiently  different,  the  computation  of  da  is  likely  to  be 
overwhelmed  by  errors.  Theoretically,  we  either  need  three 
points  that  when  paired  two  by  two  give  sets  of  coefficients 
that  are  sufficiently  different,  or  two  distinct  pairs  of  points 
yielding  adequately  different  sets  of  coefficients.  Practi¬ 
cally,  however,  except  for  unusual  situations,  the  former  is 
not  possible  (because  the  three  points  must  still  constitute 
two  pairs  of  points  that  are  appropriate  for  computation  of 
8).  To  avoid  confusion,  from  now  on,  unless  stated  other¬ 
wise,  we  assume  that  the  number  of  points  required  for  a 
single  computation  of  da  is  four. 

To  know  where  these  four  points  can  be  found ,  we  need 
to  examine  the  six  coefficients,  A,B,C,D,E,F,  and  to  es 
timate  their  magnitudes.  To  simplify  this  task  here  we 
concentrate  on  the  case  8o  =  0.  Yet,  the  dependence  of 
these  coefficients  on  a  large  number  of  parameters  makes  it 
virtually  impossible  to  extract  any  useful  information  from 
the  formulas  defining  them.  Nevertheless,  we  can  simplify 
our  task  by  noting  that  any  set  of  six  coefficients  that  is 
appropriate  for  computation  of  da  has  to  be  formed  by 
pairing  two  points  that  are  appropriate  for  computation  of 
8,  i.e.  two  points  whose  projections  on  the  left  image  plane 
lie  near  two  corners  that  have  the  same  latitude.  Therefore, 
for  these  two  points  we  can  roughly  assume  that  Xu  -  x2i, 
y ii  ~  —y2i.  Since  vertical  disparity  is  not  large  we  can  also 
assume  that  yir  c*  —  y2r. 

Furthermore,  we  will  focus  our  attention  only  on  the 
coefficients  in  the  denominator,  i.e.  D,  E,  and  F.  We  will 
not  examine  A,B,  and  C  for  two  reasons,  a)  they  are  still 
complicated,  and  b)  in  general  if  the  D's,  E's  and  F's  are 
sufficiently  different  so  will  be  the  As,  B's  and  C's  (this  can 
be  seen  from  equation  (35),  and  also  from  the  discussion  in 
Appendix  C).  After  replacing  Xu  and  x2i  by  X|,  ylr  and  -y2r 
by  y,,  and  pi  and  pi  by  p,  for  the  case  0O  =  0  we  may  write 


675 


D  yr[2pzt  +  /’(xi,  +  xJf)] 

(43) 

E  ~  /yr[-2p  +  xi(xlr  +  xJr)] 

(44) 

F  —  y,[2pxj  -  /l(xir  +  x2r)| 

(45) 

From  the  above  equations  we  can  see  that  by  choosing 
x\,  and  xjr  appropriately,  one  can  obtain  two  sets  of  D,  E 
and  F  that  are  sufficiently  different.  We  will  distinguish  two 
cases  depending  on  whether  Z|  is  close  to  —a  or  a  (where 
2a  is  the  length  or  the  width  of  the  image  plane). 

Case  I  (xj  —a): 

In  this  case  the  point  in  space  has  to  be  wide  and  deep. 
For,  otherwise,  it  would  be  outside  the  view  of  the  right 
camera.  Therefore,  the  right  image  will  also  be  close  to  the 
corner,  implying  that  xr  will  be  close  to  -a  too  (points  1 
and  2  in  Figure  2).  Therefore,  the  three  coefficients  will 
have  the  following  values: 


Figure  2 


y 

5 

X 

6 

5 

V 

i 

6 

left  image  plane  right  image  plane 

Figure  3 


7  V 

X 

8 

left  image  plane  right  image  plane 

Figure  4 


Case  II  (xi  c±  a): 

In  this  case  the  point  in  space  can  be  virtually  anywhere 
in  space,  and,  therefore,  x,  can  almost  vary  from  —a  to  a. 
If  Xir  x2r  ~  a  (points  3  and  4  in  Figure  2),  then  it  is 
easy  to  check  that  D  and  F  will  have  the  same  magnitude 
as  those  in  (46)  and  (48)  but  with  a  different  sign,  and  E 
will  be  the  same  as  that  in  (47).  For  small  values  of  a,  the 
term  E  sin  2a  will  be  insignificant,  and  the  set  will  be  the 
opposite  (not  really  different)  of  that  for  the  previous  case. 
Therefore,  when  a  is  small  the  two  pairs  of  points  in  Figure 
2  should  not  be  combined. 

If  Xir  xjr  ~  —a  (pair  5,6  in  Figure  3),  then 


D  2ayr  (p  -  /’) 

(49) 

E  ^  — 2/yr(p  +  as) 

(50) 

F  =*  2ayr(p  +  f2) 

(51) 

This  set  is  substantially  different  from  the  previous  ones, 
indicating  that  such  a  pair  of  points  goes  well  with  either 
the  pair  (1,2)  or  the  pair  (3,4)  in  Figure  2.  Also  it  is  easy 
to  verify  that  a  pair  of  points  with  xlr  ~  —  x2r  (pair  7,8  in 
Figure  4)  appears  to  go  well  with  all  three  previous  pairs. 
The  concern,  however,  is  that  pairs  of  points  such  as  those 
illustrated  in  Figures  3  and  4  are  relatively  rare,  and  often 
we  may  not  find  any.  Pairs  of  points  of  the  types  (1,2)  and 
(3,4)  are  more  likely  to  be  available,  but  when  a  is  small 
we  cannot  rely  on  them. 

Generally  speaking,  on  many  occasions  we  may  not  be 
able  to  compute  a  with  a  given  degree  of  reliability.  The 
question  is  what  to  do  when  this  happens.  ■  Previously  it 
has  been  shown  that  the  recovery  of  points  in  space,  while 
quite  sensitive  to  error  in  8,  is  relatively  insensitive  to  error 
in  a.  That  is  to  say,  it  is  the  dependence  of  8  on  a  which 
makes  the  knowledge  of  the  accurate  value  of  a  of  special 
importance.  The  point,  however,  is  that  the  degree  of  this 
dependence  depends  on  the  image  points  that  we  choose. 
An  interesting  question  comes  to  mind:  Is  it  possible  to  find 
a  pair  of  points  such  that  the  computed  value  of  8  becomes 
sufficiently  insensitive  to  small  variations  in  the  value  of 
a?  In  general,  for  a  single  pair  of  points,  the  answer  is  no. 
However,  in  the  next  section  we  will  see  that  if  we  use  the 
two  pairs  of  points  shown  in  Figure  2,  then  the  average  of 
the  two  computed  values  of  8  will  generally  be  insensitive 
to  small  changes  in  a,  providing  us  with  a  reliable  estimate 
of  8  despite  the  relative  uncertainty  in  the  value  of  a. 


D  =  — 2ayr(p  +  /’) 

(46) 

E  *  -2 fyr(p  -  «’) 

(47) 

F  ^  —  2ayr(p  -  /’) 

(48) 

As  the  values  of  D,  E  and  F  are  almost  the  same  for 
every  pair  of  points  having  X|  =?  —  a,  we  conclude  that  (for 
a  single  computation  of  a)  we  can  allow  only  one  pair  of 
points  with  xi  —  —a  into  the  computation. 


5.  COMPUTATION  OF  RELATIVE  PAN 
ANGLE  WITHOUT  WORLD  PAN  ANGLE 

We  would  like  to  explore  the  possibility  of  computing  8 
when  the  value  of  a  is  not  known  accurately.  Suppose  we 
have  two  points  that  not  only  pair  well  (for  computation 
of  d8),  but  also  have  their  xr’s  roughly  equal;  namely,  two 
points  for  which  we  have  the  following  relations: 


676 


XU  22  *lr 


(52) 

*11  22  *«  (53) 

Vi/  —  _  Uii  (54) 

(Examples  of  such  pairs  are  points  (1,2)  and  (3,4)  in 
Figure  2.).  Using  equations  (All)  and  (A6)  in  Appendix 
A  and  approximating  air  and  air  by  /,  the  above  relations 
allow  us  to  write 


Zuh-zul i  —  — Ti )  22  /Zr(S>,-Si,)  cot  a-2x,y,b,i  /  sin  a  (55) 

f(6tp  -  Siy)  =  -2y ,SZ  sin  a  +  2yrbt8  (56) 

where  Zjr  and  xjr  are  denoted  by  xr,  ylr  and  yir  by  yf  and 
bn  and  6ij  by  6] . 

The  above  equation  indicates  that  for  negligible  values 
of  0  the  difference  in  vertical  disparities  vanishes  as  6X  does. 
It  must  be  emphasized  that  without  the  above  mentioned 
assumptions  the  expression  for  the  difference  of  vertical  dis¬ 
parities  involves  terms  that  will  not  vanish  as  Six  does  (or 
Stx  and  6j*  do).  This  can  be  easily  verified  from  equation 
(A6). 

Employing  equations  (55)  and  (56)  the  coefficients  A 
through  F  for  the  case  80  =  0  may  be  written  as 

A  =;  —  2yr(psin  a  +  fxr  cos  a)6z  +  2yrbt(p  -  fxr  tan(a/2))0 

(57) 

B  =:  -2yr(/*rsina  -  pcosa)6z  +  2yr6|(/xr  +  ptan(a/2))0 

(58) 

C  =2  — 2yr(psin  a  -  fxr  cos  a)6*  +  2yrfe,(p  +  fxr  tan(a/2))6 


(59) 

D  22  2yr(p(xr  -  5,)  +  f2xr] 

(60) 

E  22  2yr/[x,(x,  -  6X)  -  pj 

(61) 

F  22  2yr[p(xr  -  6x)  -  f*xr\ 

(62) 

where  p  denotes  pi  =2  pj. 

It  is  worth  mentioning  that  numerical  experiments  have 
shown,  even  for  cases  when  the  relations  (52)  through  (54) 
are  only  approximately  valid,  that  the  above  expressions 
provide  fairly  good  approximations  to  the  exact  values  of 
the  six  coefficients. 

To  evaluate  the  dependence  of  0  on  a,  we  differentiate 
equation  (28)  with  respect  to  a  to  obtain 

dl/da  =  (BD  -  HI  +  (CP  ~  *f)«in(2a)  ±  ( BF  -  CE )  cos(2q)  .  . 

(Dcoe(2a)  +  £ain(2a)  +  F)>  l63) 

From  equations  (57)  through  (63)  we  can  make  the  fol¬ 
lowing  observations: 

a)  For  negligible  values  of  8,  aa  A,  B  and  C  go  to  zero 
as  Sz  does,  so  will  all  the  terms  in  the  numerator  of  dd/da. 
Therefore,  for  a  pair  of  points  that  satisfy  the  relations  (52) 
through  (54),  and  are  deep  in  space  (small  Sz),  the  com¬ 
puted  value  for  8  will  be  fairly  insensitive  to  the  inaccuracy 
in  a.  In  practice,  however,  unless  6Z  is  extremely  small  and 
8  is  no  larger  than  a  degree  or  so,  the  computed  value  of  8 
(when  a  is  inaccurate)  may  not  be  accurate  enough. 

b)  Let  us  denote  the  set  of  six  coefficients  that  results 
from  pairing  points  1  ,  2  in  Figure  1  by  Au,BXi,...,Fii, 


and  the  set  resulting  from  points  3 , 4  (in  the  same  Figure) 
by  Asi,Bu,  •  •  •  ,Fs4.  In  equations  (57)  through  (62),  if  a 
is  small  and  |xlr|  is  large,  then  terms  involving  sina  and 
tan(a/2)  may  be  ignored  and  regardless  of  the  magnitude 
of  8  the  following  relations  will  emerge: 


An  —  —  Aj4,  Bn  22  B34,  Cn  22  —  CS4  (64) 

Du  —  —Du,  En  22  Eu,  En  —  -Fit  (65) 

B\iDu  -  AiiEu  22  -(BmDu  —  AmEu)  (66) 

B\iFu  —  EnCn  22  —(BuDs4  —  E14C14)  (67) 

With  the  terms  (DC  —  AF)sin2a  and  Esin 2cr  in  the 
equation  (63)  being  small,  we  conclude  that 


d6 12  22  4  (68) 

Denoting  the  true  value  of  8  by  0O,  and  those  computed 


by  pairs  1,2  and  3,4  by  8 12  and  834,  respectively,  we 

have: 

0o  ~  0i2  4-  <i0i2 

(69) 

and 

00  ~  034  +  ^034 

(70) 

Therefore 

—  (8 12  +  834)12 

(71) 

c)  When  a  is  not  small  equations  (64)  and  (65)  will 
not  be  satisfied.  Therefore,  we  should  not  simply  take  the 
average  of  the  two  computed  values  of  8.  What  we  should 
do  instead  is  to  take  a  weighted  average  of  8 n  and  834. 
For  example,  if  |d0i2|  <  |d034],  then  80  will  be  closer  to 
8 12,  and  therefore  612  should  be  assigned  a  larger  weight. 
It  is  noted  that  although  we  can  usually  calculate  d9n  and 
d8i 4  rather  reliably,  computational  experiments  have  shown 
that,  in  general,  we  should  not  estimate  8q  from  8i2  -f  d8n  or 
034  +  d034.  As  was  suggested  earlier,  a  much  better  approach 
is  to  use  ddu  and  d034  for  assigning  appropriate  weights  to 

8n  and  834. 

Note,  however,  that  the  calculation  of  the  (weighted) 
average  of  0l2  and  834  will  make  sense  only  so  long  as  d8u 
and  d&34  are  of  opposite  signs.  Obviously  for  small  val¬ 
ues  of  a,  d&x 2  and  d8 34  do  have  opposite  signs.  We  should 
therefore  find  out  what  happens  when  the  magnitude  of 
a  increases.  This  is  done  in  Appendix  C.  The  conclusion 
is  that  within  the  range  -20  <  a  <  20  degrees  d012  and 
d&3 4  have  different  signs,  and  calculation  of  a  weighted  av¬ 
erage  is  possible.  Since  the  two  pairs  of  points  (1,2)  and 
(3,4)  in  Figure  2  cannot  be  combined  for  the  computation 
of  da  only  when  |a|  is  small  (say  less  than  7  to  8  degrees), 
therefore  the  range  -20  <  a  <  20  is  wide  enough. 

d)  The  case  when  8q  is  not  small  can  be  analyzed  in  the 
same  manner  as  was  done  in  the  previous  case. 


677 


6.  COMPUTATION  OF  RELATIVE  TILT 
ANGLE 

For  the  case  a  =  Bq  =  0  equation  (8)  reduces  to 
,  _  f{yir*u(ytr  -  lltl)  -  ihrXltUhr  —  !/n)| 

*liVlr(/J  +  yirVlt)  ~  XllVlAP  +  VlrVli) 

It  can  be  seen  that,  in  general,  so  long  as  vertical  dis¬ 
parity  can  be  calculated  reliably  so  can  be  the  relative  tilt 
angle.  In  particular,  pairs  of  points  recommended  for  the 
computation  of  0  seem  to  be  quite  appropriate  for  the  com¬ 
putation  of  <j>  too. 

Even  for  large  values  of  a  and  0,  the  computation  of  4> 
appears  straightforward  and  stable.  In  fact  we  only  have  to 
pick  pairs  of  points  so  that  neither  vertical  disparities  nor 
the  denominator  for  <j>  (which  is  the  same  as  the  denomina¬ 
tor  for  0)  is  small.  For  the  effect  of  a  large  world  pan  angle 
on  vertical  disparity,  6y,  see  Section  4. 

Note  that  to  calculate  the  world  pan  angle,  a,  instead  of 
using  the  expression  for  0  (see  equation  (28)),  we  could  have 
used  the  expression  for  4>.  The  reason  we  did  not  do  this  is 
because  the  latter  approach  has  been  found  (by  means  of 
computational  experiments)  to  be  less  reliable.  Apparently, 
the  reason  is  the  weaker  coupling  of  <t>  and  a  (as  compared 
to  0  and  a). 

7.  A  SCHEME  FOR  A  RELIABLE 
CALIBRATION 

While  computation  of  <j>  is  rather  easy,  computation  of 
0  and,  in  particular,  a  is  difficult.  If  a  sufficient  number 
of  points  that  are  appropriate  for  computation  of  these  an¬ 
gles  is  available,  then  all  of  them  can  be  computed  reliably. 
However,  in  many  images  we  may  not  find  enough  points  or 
pairs  of  points  that  can  be  sufficiently  trusted.  Therefore, 
a  program  which  is  devised  for  the  computation  of  pan  and 
tilt  tingles  should  adopt  the  following  strategy. 

The  program  should  first  consider  points  (and  pairs  of 
points)  that  appear  reliable.  If  there  exist  enough  such 
points  (usually  about  six  to  eight  pairs,  according  to  sim¬ 
ulation),  then  points  that  appear  less  reliable  need  not  be 
considered.  Otherwise,  those  points  and  combination  of 
points  that  do  not  appear  as  promising  should  be  consid¬ 
ered  too.  However,  the  program  should  have  special  code 
to  estimate  the  reliability  of  all  sensitive  mathematical  op¬ 
erations.  If  an  operation  appears  unreliable,  then  the  point 
(or  the  combination  of  points)  must  be  rejected.  If  the  re¬ 
liability  of  a  given  operation  is  such  that  neither  rejection 
nor  “full"  acceptance  seems  appropriate,  then  an  appropri¬ 
ate  weight,  i.e.  a  number  less  than  one,  should  be  assigned 
to  the  result  of  that  computation.  Therefore,  at  the  end, 
when  we  average  over  all  non-rejected  results,  the  contri¬ 
bution  of  an  individual  result  (to  the  final  result)  will  be 
proportional  to  its  degree  of  reliability.  If  the  sum  of  the 
weights  of  the  individual  results  for  a  given  angle  is  less 
than  a  certain  threshold,  then  the  computation  of  that  an¬ 


gle  must  be  considered  unreliable.  Although  for  a  this  may 
happen  rather  frequently,  we  believe  that  for  most  images 
computation  of  the  more  important  angles,  i.e.  <j>  and  6  will 
be  reliable. 

8.  CONCLUDING  REMARKS 

We  have  studied  difficulties  due  to  digitization  error 
which  arise  in  computation  of  the  relative  pan  and  tilt  an¬ 
gles  and  the  world  pan  angle  (also  referred  to  as  gaze  angle) . 
If  the  gaze  angle  is  zero  and  the  vergence  angle  is  small,  then 
the  best  points  to  be  used  in  the  computation  of  the  rela¬ 
tive  orientation  angles  are  corner  points  and  that  the  best 
combination  of  points  are  those  that  pair  points  from  two 
corners  that  have  the  same  latitude.  (In  particular,  points 
coming  from  two  opposite  corners  must  not  be  combined.) 
If  either  the  world  pan  angle  or  the  vergence  angle  is  large, 
then  of  the  two  pairs  of  corners  that  have  the  same  latitude 
only  one  pair  of  corners  should  be  considered.  Indeed,  in 
this  case  points  that  are  horizontally  near  the  middle  of 
the  image  may  give  rise  to  more  stable  computations  than 
those  from  the  “bad”  corners. 

Since  the  computation  of  the  world  pan  angle  is  fre¬ 
quently  unreliable,  it  is  necessary  to  bypass  the  computa¬ 
tion  of  this  angle  and  to  compute  the  relative  orientation 
angles.  This  can  often  be  done;  despite  the  fact  that  in  the 
analytic  formulation  of  the  problem  the  relative  orientation 
angles  and  the  world  pan  angle  are  coupled.  If  the  world 
pan  angle  cannot  be  computed  reliably,  one  should  take  the 
average  of  results  obtained  from  the  two  pairs  of  corners 
(unless  the  horizontal  disparities  of  these  points  are  con¬ 
siderably  different  from  each  other).  When  the  world  pan 
angle  or  the  vergence  angle  are  large,  then  to  take  average, 
points  from  the  inappropriate  corner  should  be  replaced 
by  points  closer  to  the  middle  of  the  image.  Also,  simple 
averaging  should  be  replaced  by  a  weighted  averaging.  It 
is  interesting  to  note  that  Kamgar-Parsi  and  Eastman  [7J 
have  shown  that  to  compute  the  relative  roll  angle  when 
the  value  of  the  gaze  angle  is  not  known  accurately,  one 
should  take  the  average  of  results  obtained  by  combining 
points  from  two  different  sets  of  three  corners. 

APPENDIX  A 

Numerical  computation  of  0,  <t>  r  id  a  requires  the  calcu¬ 
lation  of  many  intermediate  quantities  (such  as  t, S,  etc.). 
To  discuss  the  computational  difficulties,  we  have  to  exam¬ 
ine  the  magnitudes  of  these  quantities  separately.  In  the 
main  text  these  quantities  axe  expressed  in  terms  of  xr,  xi,  yr 
and  yi  (of  a  given  point  in  space),  which  makes  it  difficult 
to  estimate  their  magnitudes.  This  is  because  for  a  point 
in  space  not  all  x  and  y  coordinates  of  both  left  and  right 
image  positions  are  independent  quantities.  Therefore  in 
this  Appendix  we  have  eliminated  either  yi  or  y,  or  xr  and 
have  derived  expressions  for  the  intermediate  quantities  in 


673 


terms  of  8,4  and  the  remaining  projected  quantities. 

First,  we  shall  derive  expressions  for  1/A i  and  Ar/A(. 
Equations  (2)  and  (4)  in  the  main  text  can  be  written  as 

*r(A,/A,)  =  <U  -ctO  +  yi  sin 80<t>  +  bco8a/Xt  (Al) 


Expression  for  7  (in  terms  of  xi,yr  and  yt): 
Substitution  of  (A8)  into  7  yields 

flliSy  cos  a  -  (pyi  cos  a  +  fy,  X\  sin  a)<f>  +  VrVM  (All) 
(»  +  }4>)  sin  a 


— /(Ar/A|)  =  -ci  -  d{0  +  pi  cos  80<f>  +  6sin  a/Aj  (A2) 


Elimination  of  Aj  except  when  it  occurs  in  ratio  with  A, 
yields 


A,/Aj 


at  +  bi$  -  yt  cos  a<t> 
Or 


Expression  for  yp 

Equations  (A3)  and  (3)  yield 

VrQj  +  Urt>l8  ~  fa r<t> 

y>  ~  Or  +  yr(cos(0o  +  a)<t> 

Expression  for  yr: 

Equation  (A4)  can  be  written  as 

_  <»r(yi  +  !  4) 
y'  a)  +  b\8  -  yi  cos(fl0  +  <*)4 


(A3) 


(A4) 


(A5) 


Expression  for  Vertical  Disparity  when  60  =  0: 
From  equation  (A4)  we  obtain 


APPENDIX  B 

We  would  like  to  make  an  estimate  of  the  error  we  would 
incur  in  the  computation  of  8  for  the  case  a— 0. 

The  pair  of  points  used  for  the  computation  of  8  are 
assumed  to  be  near  two  corners  of  the  left  image  plane 
that  have  the  same  latitude.  Assuming  that  the  error  in 
the  computation  of  the  vertical  disparity  is  negligible,  then 
virtually  all  the  error  will  come  from  the  computation  of 
num.  Utilizing  the  numerator  in  equation  (21)  which  is 
essentially  the  same  as  num,  it  can  be  verified  that  for  the 
pair  of  points  that  we  have  considered  we  can  roughly  write 

num  ~  (/,+yi,yii)[(y2r-yai)-(yir-yii)!  (Bl) 

Thus,  the  sensitivity  of  num  to  error  is  basically  the 
same  as  the  sensitivity  of  the  difference  of  the  two  vertical 
disparities.  From  equation  (A7)  we  find 

(yjr-y2i)-(j/ir-yii)  =  2  xuyiro/f  (02) 


6*  =  Vr  -  Vl 


yr6x  sin  a  4-  (far  +  V,  cos  a)4  ~  1/rM 
Or  +  y,  cos  (a)4 


(A6) 


Note  that  the  above  expression  does  not  involve  yj.  Like¬ 
wise  we  can  eliminate  yr  and  obtain  a  similar  expression  for 
vertical  disparity.  For  this,  we  rearrange  the  above  equation 
to  obtain 


In  units  of  pixels,  the  right  hand  side  of  the  above  ex¬ 
pression  is  about  4 8  (where  8  is  in  degrees). 

The  average  error  for  the  difference  of  vertical  dispari¬ 
ties  (in  units  of  pixels)  is  equal  to 


yl/J 

r  1/2 

f11* 

/1/2 

J- 1/2  J 

— 1/2  J 

- 1/2  J 

-1/2 

+  ij  —  is  —  x\\dx\dxidxydxi 


8j,  =  yr-  yt 


yi  6Z  sin  a  +  (fa r  +  yf  cos  a)4  —  ytM 
at  +  M  —  yt  cos  a4 


(A7) 


where  Cx  =  x,  -  X|  is  the  horizontal  disparity. 

Expression  for  xr  (when  80  =  G  but  a  is  not  zero) : 
Equation  (A3)  can  be  rearranged  to  obtain 


Xr 


Xiyr  sin  a  +  f6t  cos  a  +  yrM  -  (/*  +  VrVi)  cos  a4 
(yi  +  f4)  sin  o 


Expression  for  7  sin  a  —  f6y  cos  a: 

Substitution  of  equation  (A4)  into  the  expression  of  in¬ 
terest  yields 


7  sin  a  —  cos  a  = 


yra,M  ~  (/<»?  +  y? at  cos  a)4 
Or  +  cos  ay ,4 


(A9) 


Expression  for  num: 

After  substituting  equation  (A4)  for  yi  into  the  numer¬ 
ator  of  (7)  and  ignoring  terms  of  the  second  order  in  8  and 
4,  we  obtain 


n«mc  \f{yutu*lr- IOr*ll«lr)+VlrV«r(S«rtu*>l/*»»-Vlrbi®ll/0lr)l®  (A10) 


Using  the  relation 


|a|  if  |a|>l/2 

o’  +  1/4  if  jaj<l/2 


(B3) 


after  considerable  algebra,  the  quadruple  integral  is  found 
to  be  equal  to  7/15,  Therefore,  on  the  average,  the  relative 
error  in  the  computation  of  the  difference  of  vertical  dispari¬ 
ties  is  7/(60 8).  From  this  we  conclude  that,  on  the  average, 
the  absolute  error  in  the  computation  of  8  is  7/60  ~  .12 
degrees. 


APPENDIX  C 

We  want  to  know  for  what  values  of  the  world  pan  angle, 
a,  the  derivative  of  the  relative  pan  angle,  8,  with  respect  to 
a,  i.e.  dO/da  has  opposite  signs  for  the  two  pairs  of  points 
(1,2)  and  (3,4)  in  Figure  2.  Since  we  know  that  for  small 
values  of  a,  ddn/da  <  0  and  dBn/da  >  0,  what  we  want 
to  find  out  is  the  maximum  value  of  |a|  for  which  we  still 
have  dBn/da  <  0  and  dOu/da  >  0. 

To  do  this  we  have  to  examine  the  magnitudes  of  the 


679 


terms  in  the  numerator  of  dBjda  (see  equation  (57)),  and 
therefore  the  coefficients  A  through  F.  It  can  be  verified, 
comparatively  speaking,  that  the  magnitudes  of  D  and  E 
are  “large",  of  B  and  F  “medium",  and  of  A  and  C  (de¬ 
pending  on  the  sign  of  xr  and  the  value  of  a)  “small”  to 
“medium”.  Therefore,  when  determining  the  sign  of  d$/da 
the  terms  that  need  be  considered  are  BD,  AE  (when  A 
is  not  small)  and  DC  sin  2 a  and  EC  cos  2a  (when  C  is  not 
small).  Due  to  symmetry  we  only  need  to  study  the  case  of 
positive  values  of  a.  For  a  >  0,  depending  on  the  sign  of 
xr  we  will  distinguish  two  cases: 

I)  a  >  0,  xr>0 

In  this  case  C  is  small.  Therefore,  we  only  need  to 
examine  the  sign  of  the  expression  BD  -  AE.  Since  in  this 
case  we  have  A>0,B<0,D>0  and  E  <  0,  both  BD  and 
AE  are  negative.  For  small  valuer  of  a,  |f?|  is  appreciably 
larger  than  A.  Thus  BD—AE  >  G.  However  as  a  increases, 
B  decreases  and  A  increases.  Since  the  magnitudes  of  D 
and  E  are  roughly  equal,  BD  —  AE  changes  its  sign  when 
|A|  a  |B|.  To  find  an  estimate  of  the  angle,  ac,  at  which 
this  happens  we  set  A  —  —B  to  obtain 

tana'  =  Jr7l  (C1) 

For  a  pair  of  points  such  as  (3,4)  in  Figure  2,  we  have 
tanae  se  .5,  and  therefore  ae  =  27  degrees.  That  is,  up  to 
about  27  degrees  BD-AE  retains  the  sign  it  has  at  a  =  0. 
Numerical  computations,  however,  suggest  that  the  deriva¬ 
tive  changes  its  sign  at  about  22-24  degrees  (depending  on 
the  closeness  of  the  pair  of  points  to  the  corners).  Owing 
to  the  fact,  unlike  our  assumption  when  we  calculated  ae, 
that  D  and  E  are  not  exactly  equal  and  also  the  fact  that 
other  terms  can  have  some  small  effects,  the  discrepancy 
between  27  and  22-24  degrees  is  not  unexpected. 

II)  a  >  0,  xr  <  0 

In  this  case  A  is  small.  Therefore,  we  should  examine 
the  sign  of  the  expression  BD  +  C£>sin2a  —  CE  cos  2a, 
or  (since  in  this  case  D  <  0  and  E  <  0)  basically  the 
expression  -B  +  Ceos 2a  -  C sin 2a.  At  27  degrees  the 
magnitudes  of  B  and  C  become  equal.  Thus,  for  the  term 
C sin 2a  to  overcome  —B  +  Ceos 2a,  a  must  be  consider¬ 
ably  larger  than  27  degrees.  That  is,  for  a  pair  of  points 
such  as  (1,2)  in  Figure  2,  d$u/da  remains  positive  much 
longer  than  dOn/da  remains  negative.  Therefore,  d$u/da 
and  diu/da  have  different  signs  up  to  27  (or  as  was  men¬ 
tioned  above  22-24)  degrees.  Since  non-negligible  values  of 
8,  in  some  cases,  may  slightly  decrease  this  range,  a  more 
dependable  range  will  roughly  be  -20  <  |aj  <  20  degrees. 

ACKNOWLEDGEMENT 

The  author  would  like  to  thank  Professor  Larry  Davis 
and  Professor  Axriel  Rosenfeld  for  their  helpful  comments 
and  discussion. 


REFERENCES 

[1]  F.  Solina,  “Errors  in  Stereo  Due  to  Quantization”, 
MS-CIS-34,  Department  of  Computer  and  Information  Sci¬ 
ence,  University  of  Pennsylvania  (1985). 

[2]  R.  Duda  and  P.  Hart,  Pattern  Classification  and 
Scene  Analysis,  John  Wiley  and  Sons  (1973). 

[3]  A.  Kak,  “Depth  Perception  for  Robots” ,  Handbook 
of  Industrial  Robotics,  Editor:  S.  Nof,  John  Wiley  and 
Sons  (1985). 

[4]  R.  Schalkoff,  “Automatic  Recalibration  of  Moving 
Cameras  in  Stereo  Vision  Systems”,  Image  and  Vision 
Computing  Vol.  3,  118-121  (1985). 

[5]  B.  Kamgar-Parsi,  “Practical  Computation  of  Pan 
and  Tilt  Angles  in  Stereo” ,  University  of  Maryland,  Center 
for  Automation  Research  Technical  Report  195,  1986. 

[6]  B.  Kamgar-Parsi  and  B.  Kamgar-Parsi,  “Evaluation 
of  Digitization  Error  in  Computer  Vision”,  University  of 
Maryland,  Center  for  Automation  Research  Technical  Re¬ 
port,  in  preparation. 

[7]  B.  Kamgar-Parsi  and  R.  D.  Eastman,  “Calibration 
of  a  Stereo  System  with  Small  Relative  Angles” ,  University 
of  Maryland,  Center  for  Automation  Research  Technical 
Report  240,  1986. 


680 


Uncertainty  Analysis  of  Image 
Measurements  * 


*  1  1 i 


M.  A.  Snyder 

Computer  and  Information  Science 
Univ.  of  Mass. 

Amherst,  Ma.  01003 


Abstract 

In  this  paper  we  discuss  the  implications  for  “low-lever 
computer  vision  of  the  uncertainty  in  image  measurements. 
We  illustrate  these  ideas  with  applications  to  1)  the  re¬ 
covery  of  depth  from  both  stereo  and  motion,  2)  the  re¬ 
finement  of  depth  maps  in  multiple-frame  motion  algo¬ 
rithms,  3)  the  detection  of  independently  moving  objects, 
and  4)  the  relative  efficacy  of  stereo  and  motion  in  recov¬ 
ering  depth.  We  conclude  with  several  numerical  examples 
illustrating  this  analysis,  and  some  conjectures  on  possi¬ 
ble  future  applications  of  uncertainty  analysis  in  computer 
vision. 


1  Introduction 

The  calculation  of  environmental  (3D)  parameters  is  a  ma¬ 
jor  concern  of  many  techniques  in  computer  vision.  For  in¬ 
stance,  an  algorithm  for  stereopsis  computes  the  3D  depth 
of  environmental  points,  and  a  motion  algorithm  computes 
depth  and/or  motion  parameters  (rotational  and  transla¬ 
tional  velocities).  These  environmental  parameters  might 
be  desired  quantities  in  themselves,  or  they  could  be  used 
for  further  processing,  such  as  for  shape  recovery  and  the 
segmentation  of  the  environment  into  distinct  objects,  or 
for  recognition,  path-planning,  etc.  All  such  techniques 
must  take  as  input  the  position  in  the  image  or  images 
of  particular  “interesting”  image  structures  such  as  points, 
contours,  and  regions.  In  the  case  of  correspondence-based 
techniques  for  stereo  or  motion,  the  position  of  correspond¬ 
ing  points  in  two  or  more  images  serves  as  the  input  to 
an  algorithm,  the  output  of  which  is  some  environmen¬ 
tal  parameter  o?  parameters.  We  will  be  interested  here 
only  in  the  case  where  the  output  is  a  numerical  quantity, 
such  as  depth.  We  think  that  similar  considerations  will 
hold  for  non-numerical  output  quantities  zero-crossings, 
straight  lines,  regions,  etc.,  but  the  application  of  our  ideas 
to  these  is  not  yet  clear. 

•TUt  Ntraich  was  supported  by  Army  ETL  sader  *raat  DACA76- 
M-C-OOOt 


In  most  correspondence-based  techniques  the  positions 
of  corresponding  points  in  two  views  are  determined  either 
by  using  an  interest  operator  (such  as  Moravec’s[Moral980] 
or  Kitchen  and  Rosenfeld’s  [Kitcl982])  to  select  distinguished 
image  points  which  are  most  easy  to  match,  or  by  using  a 
correlation  measure.  We  investigate  in  this  section  how 
accurately  the  position  of  such  image  points  can  be  deter¬ 
mined.  We  confine  our  discussion  to  interest  points  selected 
by  some  interest  operator,  but  identical  considerations  will 
hold  for  techniques  which  use  a  correlation  measure. 

Techniques  which  select  interest  points  consist  in  opti¬ 
mizing  some  numerical  measure.  The  location  of  the  in¬ 
terest  point  is  then  defined  as  the  center  of  the  pixel  in 
which  the  position  of  the  optimum  lies.  It  then  follows  _ 
that  if  all  that  is  known  about  the  interest  point  is  that 
the  optimum  lies  within  a  particular  pixel,  the  actual  po¬ 
sition  of  the  optimum  is  uncertain — it  could  be  anywhere 
inside  the  indicated  pixel.  We  shall  express  this  by  saying 
that  the  position  of  the  interest  point  is  known  only  to  lie 
within  an  uncertainty  region  Jl,  which  in  this  case  is  the 
pixel  which  contains  the  optimum.  This  uncertainty  will 
be  called  discretization  uncertainty,  since  it  arises  from  the 
discretization  of  the  image  plane. 

In  general,  image  points  may  not  be  in  the  place  pre¬ 
dicted  by  the  laws  of  projection  because  of  sensor  distortion 
and  aliasing,  in  addition  to  the  above-mentioned  discretiza¬ 
tion  of  the  image  plane,  or  they  may  appear  with  intensity 
different  from  the  laws  of  physics  because  of  sensor  noise 
and  the  quantization  of  grey-levels.  These  phenomena  give 
rise  to  what  are  usually  called  “errors,”  since  they  lead  to 
image  structures  which  do  not  correspond  to  physical  struc¬ 
tures  in  the  environment.  These  “errors”  have  no  meaning 
in  terms  of  the  image,  of  course,  since  their  existence  can  be 
ascertained  only  by  comparison  of  the  image  with  the  3D 
environment  which  gave  rise  to  it.  Since  this  comparison  is 
ordinarily  not  available  for  a  vision  system,  the  concept  of 
“error”  is  not  generally  useful. 

Although  “errors”  cannot  be  experimentally  defined  in 
terms  of  the  image  alone,  the  effects  of  errors  can  be  so 
defined.  Let  us  represent  the  interest  operator  as  a  surface, 
with  the  optimum  corresponding  to  a  peak  in  the  surface. 
We  shall  call  the  position  in  the  image  of  this  optimum 
the  nominal  position  of  the  interest  point  in  question.  Er- 


681 


rors  of  the  sort  noted  above  will  generally  lead  to  surfaces 
which  have  a  broad  peak,  rather  than  the  sharp  peak  that 
would  be  obtained  in  their  absence.  It  is  intuitive  that  the 
broader  the  peak,  the  greater  must  have  been  the  effect 
of  “errors”  (whatever  their  source)  on  the  position  of  the 
interest  point,  and  the  less  certain  we  should  be  that  the 
position  of  the  peak  corresponds  to  the  position  that  would 
be  found  in  the  absence  of  any  error.  That  is,  the  imecr- 
tainty  in  the  position  of  the  peak  should  be  related  to  the 
broadness  of  the  peak;  a  broad  peak  corresponds  to  a  more 
uncertain  position  of  the  peak  than  does  a  narrow  peak. 
If  this  uncertainty  is  ignored,  the  position  of  the  interest 
point  will  be  what  we  have  called  the  “nominal”  position 
of  the  interest  point.  We  will  therefore  take  the  adjective 
“nominal”  to  mean  “in  the  absence  of  uncertainty.”  The 
nominal  value  of  a  quantity  Q  will  be  denoted  by  enclosing 
it  in  angle  brackets:  {Q). 

We  make  these  observations  more  quantitative  by  defin¬ 
ing  the  uncertainty  region  (UR)  for  the  interest  point  as  the 
largest  connected  set  of  image  points  in  the  neighborhood 
of  the  position  of  the  optimum  of  the  interest  measure  for 
which  the  value  of  the  interest  measure  exceeds  a  certain 
threshold  9.  The  dependence  of  this  uncertainty  region 
on  the  threshold  is  a  reflection  of  the  fact  that  there  is  no 
canonical  definition  of  the  “broadness”  of  the  peak.  A  more 
precise  definition  could  be  given  by  fitting  a  Gaussian  sur¬ 
face  to  the  interest  measure  surface,  and  defining  the  UR 
in  terms  of  the  principal  curvatures  (or  standard  deviations 
along  the  principal  directions)  of  the  surface.  Although  this 
would  be  a  more  mathematically  satisfying  procedure,  it 
would  also  clearly  be  more  computationally  expensive.  We 
therefore  view  the  proper  definition  of  the  uncertainty  re¬ 
gion  for  each  interest  point  to  be  an  experimental  question; 
we  will  not  consider  it  further  in  the  present  work.  If  a 
quantity  Q  is  a  scalar  quantity  (such  as  depth),  the  UR  is 
just  an  interval  /,  which  we  will  write  as 

/  =  [Q™“  -6Q,  +  6Q).  ( 1 ) 

The  fact  that  the  value  Q  is  in  this  interval  will  also  be 
expressed  by  the  standard  notation 

Q  =  Qm'“  ±  6Q.  (2) 

Here  is  just  the  center  of  the  uncertainty  interval, 

and  6Q  is  the  uncertainty  in  Q. 

Since  errors  are  not  defined  in  terms  of  the  image  alone, 
they  are  not  sources  of  information  about  the  image.  How¬ 
ever,  the  structure  of  the  interest  measure  around  an  in¬ 
terest  point  is  information  about  the  effect  of  errors  on 
the  image.  Indeed,  it  is  all  of  the  information  that  can  be 
obtained  about  these  errors  from  the  image  alone.  Conse¬ 
quently,  the  calculation  of  uncertainties  can  be  viewed  as  a 
conversion  of  “non-information”  (errors)  into  information. 
This  information  is  not  ordinarily  used.  The  moot  impor¬ 
tant  point  of  the  present  work  is  that  if  one  truly  wishes 
to  use  all  the  information  present  in  an  image,  then  the 


uncertainty  in  the  positions  of  interest  points  (or  any  other 
directly  measured  quantity)  must  be  calculated. 

The  definition  of  the  uncertainty  region  for  an  interest 
point  F  is  an  essentially  mathematical  one,  but  should  have 
some  physical  interpretation.  We  propose  that  the  uncer¬ 
tainty  region  should  be  chosen  so  that  it  is  “highly  likely” 
that  the  interest  point  in  fact  lies  somewhere  inside  the  un¬ 
certainty  region.  This  means  that  we  make  the  following 
physical  interpretation  of  the  uncertainty  region:  any  im¬ 
age  point  in  the  uncertainty  region  for  the  interest  point  F 
is  a  possible  position  of  the  interest  point  consistent  with 
the  information  in  the  image.  This  means  that  although  in 
a  particular  experiment  (i.e.,  a  particular  image)  we  find  a 
definite  position  for  the  interest  point  (namely,  its  nominal 
position),  any  other  point  in  the  uncertainty  region  could 
have  been  found  as  the  position  of  the  interest  point  (be¬ 
cause  of  the  random  nature  of  sensor  noise,  phase  effects  in 
aliasing,  etc. 

1.1  Combining  Uncertainties 

A  vision  algorithm  will  generally  directly  measure  some 
quantities,  such  as  the  positions  and  grey-level  intensities 
of  image  points,  and  then  use  those  directly  measured  quan¬ 
tities  to  compute  other  ( derived )  quantities,  such  as  depths 
or  motion  parameters.  We  discuss  here  how  the  uncertainty 
regions  for  directly  measured  quantities  are  used  to  com¬ 
pute  the  uncertainty  regions  for  derived  quantities. 

Suppose  that  P!  and  P2  are  two  directly  measured  quan¬ 
tities,  and  that  Q  is  a  quantity  which  is  computed  from  Pi 
and  P2.  That  is,  there  is  some  functional  relation  between 
the  values  each  quantity  takes: 

f{Pi,Pt,Q)=  0.  (3) 

Suppose  that  we  have  determined  the  uncertainty  regions 
£i  and  R2  for  Pt  and  P2.  Pick  a  point  Pt  in  R\  and  a  point 
P2  in  R2  and  compute,  from  (3),  the  resulting  value  of  Q. 
If  Px  and  P2  are  then  allowed  to  independently  range  over 
their  respective  uncertainty  regions,  Q  will  range  over  some 
region  R.  Owing  to  the  physical  interpretation  we  gave  to 
the  uncertainty  region,  this  region  is  just  the  uncertainty 
region  for  the  derived  quantity  Q.  The  calculation  of  the 
uncertainty  regions  for  the  quantities  computed  by  an  al¬ 
gorithm  will  be  called  an  uncertainty  analysis  (UA)  of  the 
algorithm. 

We  recall  that  the  nominal  value  (Q)  for  a  quantity  Q  is 
the  value  that  Q  would  take  if  uncertainty  were  neglected. 
The  value  Qm*“,  on  the  other  hand,  is  just  the  center  of 
the  uncertainty  interval  for  Q.  For  directly  measured  quan¬ 
tities,  we  will  (for  simplicity  only)  in  general  take  these  two 
values  to  be  equal.  For  derived  quantities,  however,  there 
is  no  a  priori  reason  for  the  equality  of  the  “mean”  and 
nominal  values.  Indeed,  in  section  3  we  will  see  examples 
where  these  two  values  differ. 


682 


1.2  Uncertainty  regions  and  probability 
measures 

An  alternative  approach  to  the  analysis  of  uncertainty  can 
be  given  by  ascribing  a  probability  to  each  value  that  a 
quantity  Q  takes.  This  point  of  view  has  been  taken  by 
several  researchers  (Faug86,Matt86|.  This  would  be  im¬ 
plemented  by  giving  the  probability  density  p{Q),  where 
p(Q)  dQ  is  the  probability  that  Q  takes  a  value  between 
Q  and  Q  +  dQ.  One  could  assume,  as  in  photogrammetry, 
[Slam80]  that  (in  the  absence  of  evidence  to  the  contrary) 
p(Q)  is  a  Gaussian  normal  distribution  centered  at  Qm'“, 
of  standard  deviation  a.  If  Q  is  a  scalar  quantity  this  will 
take  the  form 


*(<?)  = 


-(Q  - 

2<X* 


The  distinction  between  the  this  “probability”  approach 
and  the  “uncertainty  region”  approach  is  artificial,  since 
the  UR  approach  can  be  given  a  probability  interpretation 
by  letting  the  probability  density  be  constant  inside  the  un¬ 
certainty  region  and  zero  outside  the  region.  For  instance, 
if  we  write  the  uncertainty  region  as  the  interval  (1),  then 
the  corresponding  probability  approach  is  to  take 


<5> 

for  Q  €  f,  and  0  otherwise.  A  rough  comparison  of  the 
UR  approach  with  the  Gaussian  normal  approach  can  be 
obtained  by  taking  6Q  =  a. 

For  simplicity,  we  have  ignored  all  probability  consider¬ 
ations  in  defining  the  uncertainty  regions  for  derived  quan¬ 
tities.  For  example,  although  the  addition  of  two  uncertain 
quantities  (each  having  the  “flat”  distribution  (5))  will  have 
a  “triangular”  probability  distribution  (since  this  is  the  con¬ 
volution  of  two  flat  distributions),  we  will  treat  it  as  “flat.” 

Whether  a  “probability”  approach  using,  for  example, 
a  Gaussian  normal  probability  density,  or  an  “uncertainty 
region”  approach  is  most  appropriate  depends  on  the  par¬ 
ticular  task  at  hand,  the  hardware  on  which  the  algorithm  is 
implemented,  etc.  The  probability  approach  has  the  virtue 
of  being  the  most  elegant  and  mathematically  satisfactory, 
but  it  will  generally  involve  more  computation  than  the  un¬ 
certainty  region  approach.  We  thus  suspect  that  the  prob¬ 
ability  approach  will  be  of  interest  mainly  in  theoretical 
work,  while  the  UR  approach  will  be  of  more  practical  in¬ 
terest.  We  will  use  only  the  UR  approach  in  what  follows, 
but  the  reader  should  keep  in  mind  that  everything  we  have 
to  say  about  uncertainty  can  be  readily  couched  in  the  lan¬ 
guage  of  probability  densities. 


2  Why  uncertainty  analysis  is  im¬ 
portant 

Uncertainty  analysis  (UA)  is  important,  even  critical,  for 
the  performance  of  some  common  tasks  of  vision  systems: 

•  Obstacle  Avoidance  A  sensor  must  use  an  algo¬ 
rithm  which  computes  the  3D  position  of  objects  in 
the  environment  if  it  is  to  navigate  without  colliding 
with  obstacles.  By  performing  a  UA  of  the  algorithm, 
one  calculates  the  uncertainty  region  for  each  of  the 
obstacles.  This  region  corresponds,  according  to  our 
physical  interpretation,  to  the  possible  positions  of 
the  object,  and  is  therefore  the  region  which  must  be 
avoided  by  the  sensor.  Only  a  portion  of  these  re¬ 
gions  (that  corresponding  to  the  nominal  positions  of 
the  obstacles)  would  be  obtained  if  no  Ua  oi  the 
algorithm  is  performed. 

•  When  an  algorithm  should  not  be  used  If  an 

algorithm  computes  the  value  Q  of  some  quantity  Q, 
but  a  UA  of  the  algorithm  shows  that  the  relative  un¬ 
certainty  in  Q  is  large  (i.e.,  close  to  100%),  then  the 
algorithm  will  not  give  accurate  results,  and  so  should 
not  be  used  in  cases  where  precision  is  important.  For 
instance,  highly  accurate  depths  are  required  if  the 
shape  of  environmental  objects  is  to  be  determined 
from  a  depth  map.  If  the  algorithm  in  question  gives 
highly  uncertain  depths,  then  it  should  not  be  used 
to  compute  shape.  Again,  without  a  UA  of  the  algo¬ 
rithm,  this  could  not  be  known. 

•  Deciding  between  algorithms  When  two  (or  more) 
vision  modules  (e.g.,  stereo  and  motion)  compute  the 
same  environmental  parameter  (e.g.,  depth),  a  UA 
of  each  will  give  the  circumstances  under  which  each 
module  is  expected  to  give  the  most  accurate  results 
for  the  parameter.  This  can  be  used  to  decide  which 
module  should  be  used  in  order  to  find  the  most  ac¬ 
curate  values  for  the  parameter  (see  section  4). 

•  Calculating  search  regions  In  the  case  of  multi¬ 
ple  frame  motion  algorithms,  or  any  other  algorithm 
in  which  the  positions  of  image  points  are  used  to 
predict  the  search  region  for  further  instances  of  the 
point,  UA  can  be  used  to  constrain  the  search  region, 
and  hence  will  give  a  more  computationally  efficient 
algorithm  (by  preventing  too  large  a  search  region), 
and  will  minimize  the  problem  of  false  matches  (by 
ensuring  that  the  search  region  is  large  enough  to 
guarantee  the  presence  of  the  desired  match).  Ab¬ 
sent  an  uncertainty  analysis,  these  search  regions  can 
only  be  guessed. 


583 


•  Generality  Most  vision  algorithms  work  well  only 
on  a  few  images  or  sets  of  images,  and  poorly  on  any¬ 
thing  else.  One  reason  for  this  lack  of  generality  is 
that  the  performance  of  the  algorithm  is  often  highly 
dependent  on  the  values  assigned  to  certain  weights 
or  thresholds.  In  addition,  these  values  are  often  as¬ 
signed  by  the  experimenter  so  as  to  maximize  this 
performance  on  a  particular  set  of  images.  Since  UA 
will  give  the  uncertainty  in  the  parameters  of  the  algo¬ 
rithm,  in  certain  cases  the  inverse  of  the  uncertainty 
can  be  used  as  a  weight,  or  to  construct  an  appropri¬ 
ate  threshold.  U  A  can  thus  possibly  be  used  to  inter¬ 
actively  set  these  weights  or  thresholds  according  to 
the  uncertainties  found  by  the  algorithm,  and  hence 
could  possibly  lead  to  more  robust  general  purpose  vi¬ 
sion  systems.  Some  efforts  along  these  lines  have  been 
made  by  Moravec  [Moral980]  and  by  Canny  [Cann83] 
(see  section  3). 

•  Robustness  and  Graceless  Degradation  With 
few  exceptions,  present  vision  algorithms  are  numeri¬ 
cally  unstable  and  fragile.  This  graceless  degradation 
is  a  direct  consequence  of  the  sensitivity  of  the  al¬ 
gorithms  to  noise  and  other  “errors.”  An  algorithm 
which  incorporates  an  uncertainty  analysis,  on  the 
other  hand,  takes  such  “errors”  as  part  of  the  infor¬ 
mational  input  to  the  algorithm.  It  is  thus  possible 
that  UA  may  furnish  a  general  way  of  implementing 
the  principle  of  graceful  degradation  [Marr82]  directly 
into  a  computational  theory. 


3  Uncertainty  Analysis  and  Motion 

Uncertainty  analysis  has  important  consequences  for  mo¬ 
tion  algorithms.  In  this  section  we  address  the  implications 
of  U  A  for  two  cases  of  interest  in  motion  research.  We  first 
consider  a  camera  undergoing  uniform  translational  motion 
through  a  rigid  environment,  i.e.,  there  is  only  a  single  rigid 
object,  namely  the  environment  itself.  In  the  second  case, 
we  consider  the  implications  of  UA  when  the  environment 
consists  of  several  independently  moving  rigid  objects. 


3.1  Uniform  camera  translation  through 
a  rigid  environment 

3.1.1  Basic  Equations 

We  assume  the  usual  geometry  of  perspective  projection, 
as  illustrated  in  Figure  1,  where  we  have  chosen  the  focal 
length  of  the  projection  to  be  unity.  The  camera  coordi¬ 
nate  system  is  denoted  by  (AT,  Y,  Z).  As  mentioned  in  the 
introduction,  the  camera  is  moving  with  constant  speed  in 
the  +Z  direction  through  a  rigid  environment.  In  this  case 
the  motion  of  the  image  point  p  of  some  3D  point  P  is  an¬ 
alytically  simple  [Longl980];  each  point  p  moves  along  a 
straight  trajectory,  and  and  all  such  straight  lines  intersect 


at  a  single  point  0.  Since  all  image  points  move  away  from 
0,  this  point  is  called  the  Focus  of  Expansion  (FOE). 

If  we  let  D(t)  be  the  distance  of  p  from  the  FOE  at 
time  t,  and  let  Z(t)  be  the  ^-coordinate  of  the  3D  point  P 
to  which  p  corresponds,  then  it  is  easy  to  show  [Longl980] 
that 

-  m  _  m  -  m 

D(t)  Z(t')  ’  w 

where  t  and  t'  are  any  two  times.  We  shall  refer  to  Z(t)  as 
the  depth  of  P  at  time  t.  It  follows  immediately  from  (6) 
that  the  quantity  D(t)Z(t)  is  a  constant  of  the  motion,  i.e., 

D(t)Z(t)  =  D(t')Z(t '»  (7) 

for  all  t  and  V. 

We  assume  (without  loss  of  generality)  that  the  envi¬ 
ronment  is  sampled  at  regular  intervals  where  t„  =  nr 
(n  =  0, 1, .. .,  N),  and  r  is  the  (constant)  interval  between 
successive  frames.  Since  the  motion  of  the  camera  is  uni¬ 
form,  we  let  T  be  the  distance  the  camera  moves  toward  the 
environment  in  the  time  interval  r.  We  take  t'  =  fB+1,  t  = 
t„  in  (6),  and  define: 


Dn  =  D(tn );  Zn  =  Z(tn);  A Dn  =  DB+1  -  £>„ 

(see  Figure  2).  Then  since  T  =  Z„  -  ZB+,  for  all  n, 

A  Dn  _  D„+l  —  D„  _  Zn  —  Zn+t  _  T 

Dn  D„  zn+,  ZB+, 

We  can  then  rewrite  (8)  as 


Ai+i  —  KnDn,  (9) 

where 


>  Kn-t  >  1. 


(10) 


If  we  denote  by  pB  the  position  of  the  projected  environ¬ 
mental  point  P  at  time  t„,  then  all  the  p„  (for  n  =  0, . . . ,  N) 
lie  along  a  line.  Clearly,  the  displacements  of  p  between 
frames  are  constrained  to  lie  along  this  line,  so  we  will  call 
this  line  the  displacement  path  (see  Figure  2). 

If  we  denote  by  r;  the  vector  from  the  FOE  to  pj,  j  = 
0 ,,N,  then  since  all  the  pj  lie  along  the  displacement 
path  equation  (9)  implies  that 


rB+i  —  Knrn- 


(11) 


Note  that  | fj\  =  Dj.  Equation  (II)  will  prove  to  be  very 
useful  in  the  following. 


3.1.2  Definitions 

Let  F  be  an  interest  point  in  the  image.  We  denote  by 
F„  the  instance  of  F  in  frame  n  at  time  <».  The  position 
of  F„  will  be  the  point  p„  =  (xn,yH).  The  nominal  value 
that  any  quantity  Q  takes  will  be  denoted  by  enclosing  Q  in 
angle  brackets:  { Q ).  Thus,  the  nominal  position  of  Fn  will 
be  denoted  by  (p«)  =  ((x„),  (yH)).  Recall  that  the  nominal 
value  (Q)  is  the  value  that  Q  would  have  in  the  absence  of 


684 


uncertainty.  We  will  refer  to  the  line  that  passes  through 
the  nominal  positions  of  the  FOE,  F0,  Ft,  ....  Fn  as  the 
nominal  displacement  path  (NDP). 

As  noted  in  the  introduction,  the  position  of  F„  is  un¬ 
certain  but  can  be  assumed  to  be  in  an  uncertainty  region 
ft*  which  contains  the  nominal  point  (p„).  In  order  to 
investigate  the  situation  analytically,  we  must  make  some 
assumption  about  j?„.  We  will  choose  %„  to  be  a  rectan¬ 
gle  Rn  centered  at  (p„),  with  sides  parallel  to  the  x-  and 
y— axes  of  length  2Ax„  and  2Ay„,  respectively.  We  will 
denote  these  sides  as  {2Az„,  2Ay„}.  Thus, 


appearance  of  the  formulas.  We  then  use  ( 13)  to  find  the 
allowed  range  of  values  for  Zx : 


Z,m'“  -  6  Zx  <  Z,  <  ZJ"*"  +  6  zx- 

To  simplify  the  analysis,  we  first  define  the  relative  uncer¬ 
tainty  a  in  Dq,  and  the  relative  uncertainty  0  in  AD0: 


_  6Do  .  S(ADo) 
(Do)  ’  P  (AD0)  ' 


(15) 


We  also  define  the  nominal  value  (Zx)  of  Zt  to  be 


A*,, 
Ay, 
(12) 

We  assume,  similarly,  that  the  location  of  the  FOE  is 
uncertain,  but  lies  within  a  rectangle  R  of  sides  {2Az,  2Ay} 
centered  at  the  nominal  position  (FOE)  of  the  FOE.  We 
choose  the  coordinate  system  such  that  (FOE)  is  at  the 
origin  (0,0)  of  the  coordinate  system. 

There  are  essentially  two  cases  of  interest  in  this  anal¬ 
ysis.  In  the  first  case  we  assume  that  the  depth  of  the 
environmental  point  is  unknown,  and  calculate  the  uncer¬ 
tainty  in  this  depth  which  follows  from  the  uncertainty  in 
the  positions  of  the  interest  point  in  the  two  frames  (the 
“startup”  process).  In  the  second  case  we  assume  that  the 
depth  (and  its  attendant  uncertainty)  have  been  found  in 
one  of  the  two  frames,  and  use  this  and  the  uncertainties 
in  po  and  p,  to  find  the  search  region  for  the  interest  point 
in  subsequent  frames  (the  search  portion  of  the  “updating” 
process).  We  will  only  summarize  the  results;  the  details 
of  all  calculations  in  the  remainder  of  this  section  can  be 
found  in  [Snyd86a], 


y»)  =pneR„ 


I  <*->  - 


Ax.  < 


3.1.3  Calculating  the  uncertainty  in  depth  which 
follows  from  uncertainly  known  FOE,  F0,  and 

Ft 

We  assume  that  the  FOE  and  the  positions  po  and  pt  of  the 
interest  point  F  are  known  only  to  lie  inside  their  respec¬ 
tive  uncertainty  rectangles  (see  Figure  3).  We  will  use  the 
positions  and  sizes  of  these  rectangles  to  calculate  both  the 
“best  estimate”  Z,m'4B  for  the  environmental  depth  Zt,  and 
the  uncertainty  6Zx  in  this  value.  We  have  from  (8)  that 


Zx  = 


(13) 


where  D0  and  ADq  are  assumed  to  lie  in  the  intervals 


(Do)  —  6 Do  <  D0  <  (Do)  +  6 Do, 

(AD0)  -  6(AD0)  <  AD0  <  (AD0)  +  6(AD0),  (14) 


<*■>  -  im;{n  ,,6) 

That  is,  (Zt)  is  just  the  value  one  would  expect  from  (13) 
in  the  absence  of  uncertainty. 

VVe  then  have  that 


Z™™±6Zx  =  (T)  • 


(Do)  ±  5  D0 
(AD0)T5(ADp) 


=  (Zt) 


lift 

lT/»’ 


(17) 


from  which  it  follows  that 


2frnemn 


6ZX 


(Zt) 

(Zt) 


1  +  a/3 
1  -/?*’ 
a  +  0 
1  -/?*■ 


V 18) 
(19) 


Note  that  the  “best”  estimate  Z,m'“  for  Z,  is  not  equal 
to  the  value  (Zx)  which  would  be  obtained  by  substituting 
the  best  values  for  the  other  quantities  into  (13).  Indeed, 
Zr"  >  (Zx),  equality  holding  only  when  0  =  0,  i.e.,  there 
is  no  uncertainty  in  A D0.  The  relative  uncertainty  in  Z, 
is,  using  (18)  and  (19),  given  by 


6Zt  _  01-1-0 

Z,m*“  -  1  +Q0' 


(20) 


At  the  end  of  this  section  we  use  the  result  (20)  to  give 
quantitative  meaning  to  two  well  known  results  in  motion 
analysis. 

In  practice,  the  relative  uncertainty  in  inter-frame  in¬ 
terest  point  displacements  is  often  much  larger  than  that  in 
the  distance  of  interest  points  from  the  FOE.  If  we  therefore 
neglect  a  in  (18)  and  (19), 

7mtu  oi  (Zt) 

6Zx  S !  £Z,m'“, 


(21) 

(22) 


and  T  is  known  accurately  (ST  =  0,  T  =  (T)).  We  have 
equated  the  mean  and  nominal  values  because  the  quanti¬ 
ties  involved  are  directly  measured  quantities.  The  latter 
assumption  is  made  only  for  simplicity  of  exposition;  the 
effect  of  uncertain  T  is  easily  included,  but  complicates  the 


and  hence  the  relative  uncertainty  in  Zt  is  just  equal  to  the 
relative  uncertainty  0  in  A  Do- 

The  expressions  for  a  and  0  in  terms  of  image  quantities 
depends  on  how  the  NDP  passes  through  the  uncertainty 
rectangles.  In  the  case  (called  VW  in  [Snyd86a])  that  it 


585 


passes  through  the  vertical  sides  of  each  uncertainty  rect¬ 
angle  (see  Figure  3),  we  find: 

Ax  4-  Ax0 
“  "  (*<>)  ’ 

a  _  Ax0  +  AX| 

(x,)  -  (x0) 

1  ]  Ax0  +  AX! 

(A'o)-lJ  (x0) 

1  1  f  Ax0  +  Axil 

(A'a)  -  lj  l  Ax  +  Axo  J 

Algebraically,  the  case  VVV  corresponds  to 

Ay  Ay0  Ayi  (ifo) 

Ax’  Ax0’  Axi  (x0) 

There  are  two  intuitive  and  commonly  accepted  situa¬ 
tions  in  motion  research  where  depth  estimates  are  unreli¬ 
able: 

•  When  the  interest  point  is  “close”  to  the  FOE 

•  When  the  environmental  correlate  to  the  interest  point 
is  “far  away”  from  the  camera. 

To  our  knowledge  a  serious  quantitative  analysis  of  these 
statements  has  never  been  provided.  We  can  use  the  results 
of  this  section  to  do  just  that.  In  terms  of  the  case  VVV 
given  here,  and  using  the  equations  (20)  and  (23)  we  find 

•  The  relative  uncertainty  in  the  depth  is  large  when 
the  interest  point  is  “close”  to  the  FOE,  where  “close” 
means  that  either  Ax  +  Ax0  or  Ax0  +  Axi  is  compa¬ 
rable  to  (x0)  A  sufficient,  but  not  necessary,  condi¬ 
tion  for  this  is  that  Ax0  is  comparable  to  (x0),  i.e., 
the  uncertainty  in  the  position  of  the  interest  point  is 
comparable  to  the  distance  of  the  interest  point  from 
the  FOE. 

•  The  relative  uncertainty  in  the  depth  is  large  when 
the  environmental  correlate  to  the  interest  point  is 
“far  away”  from  the  camera,  where  “far  away”  means 
that  (Z0)  >  ( T ),  i.e.,  the  camera  has  moved  between 
frames  only  a  small  fraction  of  the  distance  to  the 
point. 

The  reader  should  keep  in  mind,  however,  that  equation 
(20)  is  the  mathematical  expression  of  the  results  above. 

3.1.4  Calculating  the  search  region  for  an  interest 
point  in  subsequent  frames 

One  of  the  most  important  uses  of  uncertainty  analysis  is 
to  use  the  uncertainty  in  the  position  of  interest  points  in 
two  frames  of  a  dynamic  image  sequence  and  in  the  posi¬ 
tion  of  the  FOE  to  constrain  the  search  region  for  the  in¬ 
stant  of  the  interest  point  in  subsequent  frames.  The  first 
attempt  at  this  was  made  by  Bharwani,  Riseman,  and  Han¬ 


son  [Bhar85].  However,  their  analysis  was  essentially  one¬ 
dimensional,  since  they  confined  their  search  to  the  nom¬ 
inal  displacement  path.  Although  in  some  rases  this  may 
be  sufficient,  the  uncertainty  region  for  interest  points  is  in 
general  two-dimensional,  and  so  a  two-dimensional  analy¬ 
sis  is  of  interest.  This  analysis  was  made  in  [Snyd86a),  to 
which  the  interested  reader  is  referred  for  details.  The  gen¬ 
eral  idea  is  to  use  the  uncertainty  regions  for  the  FOE,  F0, 
and  Ft  to  compute  the  uncertainty  interval  for  the  depth 
Zx.  One  then  neglects  the  uncertainty  region  for  F0  (its 
presence  will  be  reflected  in  the  uncertainty  interval  for  Z,) 
and  uses  the  uncertainty  interval  for  Zj  and  the  uncertainty 
regions  for  the  FOE  and  Ft  to  compute  the  search  region 
for  F2 ■  One  finds  that  this  search  region  is  the  intersection 
of  the  wedge  lying  below  a  line  and  above  a  line  Lm 
with  a  rectangle  AJ  (see  Figure  4). 

3.1.5  Uncertainty  Analysis  and  multiple-frame  mo¬ 
tion  algorithms 

There  are  several  implications  of  UA  for  multiple-frame 
motion  algorithms. 

•  UA  can  be  used  to  find,  as  in  the  previous  section, 
the  search  region  for  the  instance  of  an  interest  point 
in  the  second  frame  of  a  sequence,  given  the  UR  for 
the  FOE  and  the  interest  point  in  the  first  frame, 
and  the  uncertainty  in  the  depth  estimate  for  the  en¬ 
vironmental  point  (obtained,  for  example,  from  laser 
ranging  with  an  ERIM).  Without  such  an  uncertainty 
analysis,  the  search  region  can  only  be  guessed— an 
unsatisfactory  situation  for  any  motion  algorithm. 

•  If  the  UR  for  the  FOE  and  the  interest  point  in  any 
two  frames  of  a  dynamic  image  sequence  are  known, 
then  UA  can  be  used  to  find  both  the  UR  for  the 
depth  estimate  and  the  search  region  for  the  interest 
point  in  any  subsequent  frame.  A  simple  consequence 
of  this  is  that  if  an  interest  point  has  a  large  UR,  the 
estimate  for  its  depth  will  be  highly  uncertain,  and 
the  search  region  for  further  instances  of  the  point  will 
be  large.  If  one  is  not  interested  in  points  with  large 
uncertainty  (for  instance,  if  one  is  computing  shape 
from  the  depth  map),  then  it  makes  little  sense  to 
follow  this  interest  point  through  the  sequence,  since 
it  will  give  highly  uncertain  depths  (and  hence  be  of 
little  utility),  and  searching  for  it  will  be  very  com¬ 
putationally  expensive.  By  pruning  such  points  from 
the  set  of  tracked  points,  the  computational  efficiency 
of  an  algorithm  will  be  increased.  Without  an  uncer¬ 
tainty  analysis,  this  pruning  cannot  be  done  in  any 
sensible  way. 

•  UA  furnishes  a  method  of  refining  depth  maps,  in  the 
following  sense.  The  UR  for  F,  and  Fy  will  predict 
a  search  region  J 2*  for  F*  (i  <  j  <  k).  This  search 


a.  (23) 

the  condition 
f24l 


region  represents  the  possible  positions  of  F*,  consis¬ 
tent  with  the  properties  of  the  previous  images.  Upon 
performing  this  search  and  finding  F*,  one  can  then 
calculate  the  actual  uncertainty  region  for  F*  (recall 
that  this  is  defined  in  terms  of  the  interest  operator). 
Suppose  that  this  observed  uncertainty  region  is  sig¬ 
nificantly  smaller  than  the  calculated  search  region. 
We  interpret  this  as  evidence  that  the  uncertainty  re¬ 
gions  for  F,-  and  Fy  were  too  large.  These  can  be 
made  smaller  by  increasing  the  threshold  which  de¬ 
fined  the  uncertainty  region  until  the  search  region 
for  F*  and  the  observed  UR  for  F*  are  more  nearly 
equal.  This  shrinkage  of  the  UR  for  the  previous 
frames  thus  yields  a  less  uncertain  value  for  the  depth 
of  the  point,  and  hence  will  give  a  smaller  search  re¬ 
gion  for  the  interest  point  in  subsequent  frames.  The 
important  thing  to  note  about  this  is  that  the  set¬ 
ting  of  the  threshold  by  feedback  from  subsequent 
frames  has  the  possibility  of  being  made  automatic, 
in  the  sense  that  it  will  be  done  based  on  properties 
of  the  images  alone — no  human  interaction  would  be 
required.  This  may  possibly  lead  to  more  general- 
purpose  and  robust  algorithms. 

•  A  measurement  of  the  depth  in  any  particular  frame 
of  the  sequence  can  be  considered  as  a  measurement 
of  the  depth  in  any  other  (previous  or  subsequent) 
frame,  since  the  motion  of  the  camera  is  known.  Con¬ 
sequently,  finding  the  uncertainty  in  the  depth  in  one 
frame  will  in  addition  give  a  value  for  the  uncertainty 
in  the  depth  in  any  other  frame.  Additionally,  one  can 
consider  any  pair  of  frames  as  giving  a  measurement 
of  the  depth  in  one  of  the  pair.  By  taking  this  pair 
very  far  apart  (temporally),  the  relative  uncertainty 
in  the  depth  can  be  decreased  (cf.  equation  (20)). 
The  overall  effect  is  that  in  a  sequence  of  N  frames, 
there  will  be  0(N2)  measurements  of  the  uncertainty 
in  the  depth  in  any  particular  frame.  These  uncer¬ 
tainties  can  then  be  statistically  combined  to  yield  an 
uncertainty  in  that  depth  smaller  (by  a  factor  of  order 
N)  than  any  of  the  particular  measurements  alone. 
By  thus  incorporating  U  A  into  a  multiple-frame  mo¬ 
tion  algorithm,  one  will  necessarily  acquire  less  un¬ 
certain  depth  measurements  (and  smaller  search  re¬ 
gions),  and  hopefully  more  accurate  ones. 

Some  of  the  above  suggestions  have  been  incorporated  in 
a  refinement  of  the  algorithm  of  Bharwani,  Riseman,  and 
Hanson  [Bhar86j.  We  are  presently  investigating  further 
improvements  in  this  algorithm. 

3.1.6  Minimum  time  intervals  for  the  detection  of 
motion 

If  the  camera  moves  through  a  rigid  environment,  then  in 
order  to  detect  the  relative  motion  between  the  camera  and 
an  environmental  point  located  at  (X,Y,Z)  =  (d,0,Z0)  at 


time  <0 1  then  one  must  clearly  wait  until  the  image  of  the 
point  has  moved  at  least  one-half  pixel  in  the  image  before 
any  motion  algorithm  will  be  able  to  detect  the  relative 
motion  between  this  point  and  the  camera.  If  the  camera 
speed  is  v,  then  a  simple  calculation  [Snyd86b]  shows  that 
one  must  wait  until  t  =  to  +  Afmia,  where 


At  miQ  — 


Zp  1 
V  1+2  S§ 


(25) 


before  one  will  see  motion  of  the  point.  This  therefore  cor¬ 
responds  to  the  smallest  time  interval  one  can  take  between 
the  frames  of  a  dynamic  image  sequence  if  one  wishes  to  de¬ 
tect  the  motion  of  this  point.  The  quantity  S  is  defined  as 


S~  2tan(7/2]’  (26) 

where  N  is  the  resolution  of  the  image  and  7  is  the  field  of 
view  of  the  camera.  Note  that  S  is  just  the  focal  length  of 
the  camera,  measured  in  pixels.  It  appears  in  (25)  because 
the  units  of  focal  length  must  be  converted  to  pixels  when 
measuring  image  quantities. 

As  a  numerical  example,  suppose  that  the  camera  is 
attached  to  a  car  moving  along  a  straight  two-lane  road 
at  v  =  10  km/hr,  and  that  a  stationary  obstacle  (such  as 
another  car)  is  located  in  the  center  of  the  opposing  lane  a 
distance  d  =  3  m  from  the  translational  axis  of  the  camera, 
and  a  distance  Z  =  100  m  away  from  the  camera  at  t  =  0. 
Let  N  —  256  and  7  =  45°.  Then  we  find  that  at  time  t  the 
obstacle  is  a  distance  Z  away  from  the  camera,  and  that 
we  must  wait  at  least  until  time  l  +  A/,™,  in  order  to  see 
the  image  of  the  obstacle  move  one-half  pixel  from  where 
it  was  at  time  t.  This  is  given  in  Table  1  for  selected  values 
of  t. 

The  interpretation  of  this  table  is  that  there  is  no  point 
in  taking  images  of  the  environment  more  closely  than  Afroj„ 
apart  if  one  is  interested  in  detecting  the  relative  motion 
between  the  camera  and  this  stationary  object. 


3.2  The  detection  of  independently  mov¬ 
ing  objects 

When  the  assumption  of  a  rigid  environment  is  relaxed,  and 
independently  moving  objects  in  the  environment  are  al¬ 
lowed,  the  analysis  of  the  previous  sections  does  not  strictly 
hold.  Nevertheless,  there  are  some  general  comments  that 
one  can  make  about  the  implications  of  uncertainty  analysis 
for  the  detection  of  independently  moving  objects. 

The  segmentation  of  the  environment  into  independently 
moving  rigid  objects  can  be  done  in  either  of  two  ways.  Ei¬ 
ther  points  in  the  image  are  aggregated  into  regions  of  some 
sort  based  on  the  similarity  of  their  3D  motion  parameters 
[Ullm79,Roac80],  or  on  the  basis  of  similarity  in  image  pa¬ 
rameters  [Adiv85].  At  any  rate,  though,  one  does  not  ex¬ 
pect  that  different  points  (2D  or  3D)  that  in  fact  should 
belong  to  the  same  segment  will  indeed  have  precisely  the 


587 


same  values  for  the  appropriate  parameters.  Consequently, 
in  order  to  determine  whether  points  or  regions  with  “simi¬ 
lar”  parameters  should  be  merged  into  a  single  region  must 


t  (sec) 

Z  (meters) 

AfnuB  (see) 

0 

100.0 

1.23 

5 

79.2 

0.78 

10 

58.3 

0.43 

15 

37.5 

0.18 

20 

16.6 

0.04 

Table  1:  Minimum  time  intervals  for  detection  of  motion, 
involve  setting  a  threshold  which  expresses  how  close  the 
similarity  must  be  in  order  for  the  merging  to  take  place. 
Generally,  this  threshold  is  set  “by  hand.”  But  this  is  ex¬ 
actly  the  kind  of  thing  that  UA  is  supposed  to  deal  with. 
We  outline  here  a  possible  way  of  using  UA  in  this  regard. 
We  believe  that  what  we  have  to  say  here  has  implications 
not  only  for  the  case  of  motion  or  stereo,  but  in  fact  for  any 
scheme  in  which  segmentation  is  performed  on  the  basis  of 
similiarity  in  numerical  parameters. 

Suppose  that  the  decision  whether  two  points  are  part 
of  the  same  “segment”  is  answered  by  performing  a  Hough 
transform  using  a  particular  relation  between  image  points 
and  some  parametrization.  If  two  image  points  are  then 
found  in  the  same  “bucket”  in  the  transform  space,  then 
(modulo  the  uncertainty  due  to  the  quantization  of  the 
transform  space)  they  can  consistently  be  lumped  together 
as  arising  from  an  image  structure  having  that  particular 
value  for  the  parameters.  The  existence  of  a  bucket  with 
many  such  points  in  it  is  then  evidence  that  there  is  a  signif¬ 
icant  image  structure  possessing  parameters  having  values 
corresponding  to  that  bucket. 

In  practice,  the  peaks  in  the  transform  space  have  some 
width  (at  the  very  least,  the  size  of  the  tessellation),  and 
so  the  values  of  the  parameters  corresponding  to  the  peak 
must  have  some  uncertainty  as  well,  which  will  be  related 
to  the  broadness  of  the  peak.  The  net  effect  of  this  is  that 
significant  image  structures  will  have  parameters  which  are 
not  precise,  but  rather  must,  lie  in  some  region  in  param¬ 
eter  Rpace.  This  region  will  be  the  uncertainty  region  (in 
parameter  space)  for  this  particular  image  structure. 

The  decision  whether  two  image  structures  having  un¬ 
certainty  regions  (in  parameter  space)  Rt  and  J?2,  respec¬ 
tively,  should  be  merged  into  a  single  image  structure  can 
then  be  easily  made.  The  two  structures  should  be  merged 
if  their  respective  uncertainty  regions  in  the  parameter  space 
overlap  “significantly,”  and  should  not  be  merged  if  there 
is  no  such  overlap.  This  is  because  of  the  physical  interpre¬ 
tation  we  gave  to  the  uncertainty  region.  “Significant”  can 
be  given  a  quantitative  meaning  by  defining  a  confidence 
measure  which  expresses  how  much  the  two  URs  overlap. 

For  instance,  we  could  define 


conf(l,2) 


vol[je,  n  JEil 
min  {vol  J?i),  vol [**)}’ 


(27) 


where  vol  [£]  is  the  volume  of  R  in  parameter  space.  The 
confidence  level  would  then  be  one  if  one  region  contains 
the  other,  and  zero  if  the  two  regions  did  not  intersect.  This 
confidence  level  could  then  be  thresholded  to  determine 
when  the  two  image  structures  should  be  merged.  This 
could  perhaps  aid  in  the  efficient  segmentation  of  images, 
much  as  the  confidence  measures  of  Ananadan  [Anan86]  are 
used  to  find  optical  flow.  These  ideas  are  currently  under 
investigation. 

We  also  conjecture  that  there  is  a  significant  connec¬ 
tion  between  the  uncertainty  analysis  suggested  here  and 
the  work  of  Adiv  [Adiv85]  on  inherent  ambiguities  in  the 
detection  of  independently  moving  objects.  This  is  also 
currently  under  investigation. 


4  Uncertainty  analysis  and  stereo 
vs.  motion  for  depth  recovery 


4.1  Uncertainty  analysis  for  stereo 


Uncertainty  analysis  can  be  applied  to  correspondence- 
based  stereopsis  algorithms  in  a  manner  similar  to  that  for 
motion  algorithms.  We  align  the  stereo  baseline  along  the 
z-axis  of  the  image  coordinate  system,  and  assume,  for 
simplicity,  that  the  coordinate  axes  of  the  two  sensors  are 
perfectly  aligned  (that  is,  the  uncertainty  in  these  quanti¬ 
ties  is  zero).  Since  the  images  of  an  environmental  point  in 
the  two  views  must  lie  along  the  epipolar  line,  which  is  par¬ 
allel  to  the  z-axis,  only  the  uncertainty  in  the  z-coordinate 
of  these  image  points  will  be  relevant  to  the  computation 
of  depth.  We  will  let  the  uncertainty  in  these  positions  be 
Azt  and  Az«  in  the  left  and  right  views,  respectively,  with 
similar  notation  for  the  nominal  positions  ( xt )  and  (zr)  of 
the  interest  point.  A  trivial  analysis  then  shows  that  the 
relative  uncertainty  crt  in  the  depth  Z  of  the  corresponding 
environmental  point  is  given  by 


6Z  _  Axt,  +  Azb 
.2““]*  “  (*i)  -(*«)' 


(28) 


We  see  from  this  that  for  fixed  positional  uncertainties 
in  the  interest  point,  the  relative  uncertainty  in  depth  is 
smaller,  the  larger  is  the  nominal  disparity  (A)  =  (xL)  - 
(*«>■ 

More  sophisticated  treatments  of  uncertainties  in  stereo 
have  been  made  by  Gennery  (Genn80),  Kamgar-Parsi  [Kamg86], 
and  Matthies  and  Shafer  [Matt86],  to  which  the  reader  is 
referred.  I  will  use  only  the  simplified  analysis  above  in 
what  follows. 


4.2  Stereo  vs.  Motion  for  depth  recovery 

In  section  3, 1  used  uncertainty  analysis  to  find  the  relative 
uncertainty  «mo,  (equation  (20))  in  the  depth  of  an  environ¬ 
mental  point  for  the  case  of  a  camera  undergoing  uniform 
translation  toward  a  rigid  environment.  Using  this  and  the 
result  above,  we  can  compare  the  efficacy  of  stereo  and  of 


688 


motion  for  depth  recovery  in  this  context.  We  only  outline 
the  results — details  can  be  found  in  (Snyd86c). 

We  will  consider  only  the  situation  where  the  relative 
uncertainty  in  the  distance  of  the  interest  point  from  the 
FOE  is  negligible  compared  to  that  in  the  inter-frame  dis¬ 
placement  of  the  point.  This  means  that  we  neglect  a  in 
equation  (20),  and  so,  in  the  case  VVV  considered  in  sec¬ 
tion  3,  we  have 

_  ’  SZ  _  Ax0  +  Ax,  .  , 

‘-=[55=1]^-*-  (*,)-<*„)•  (  1 

We  define  the  motion  efficacy  parameter  a  to  be  the  ratio 
of  the  relative  uncertainty  in  depth  for  stereo  to  that  for 
motion:? 

=  —  (30) 

€mot 

Thus,  for  o  >  1,  motion  will  give  less  uncertain  depth  values 
than  stereo,  while  for  o  <  1  exactly  the  reverse  obtains.  We 
obtain  from  equations  (28,29)  that 

^An  +  Axa  .MzM.  (31) 

Ax0  +  Ax,  (x,r)  -  (xR) 

In  many  cases,  the  first  factor  will  be  of  order  unity,  since 
the  same  kind  of  quantity  (an  interest  operator,  or  a  corre¬ 
lation  measure)  is  being  used  to  find  the  uneertaintites,  so 
that  approximately 

fj  =*  (*»)  ~  (32) 

"  {**>-<**> 

^  displacement  of  image  point  between  frames.^. 

—  stereo  disparity  of  image  point  between  views 

Thus,  motion  gives  less  (more)  uncertain  depth  measure¬ 
ments  than  stereo  when  the  displacement  of  the  image  point 
between  frames  of  the  motion  sequence  is  larger  (smaller) 
than  the  disparity  between  the  image  point  in  the  two  stereo 
views. 

We  now  investigate  the  physical  circumstances  under 
which  these  situations  occur.  We  consider  a  physical  situa¬ 
tion  in  which  the  environmental  correlate  P  of  some  inter¬ 
esting  point  p  in  the  image  is  located  at 

X  =  D 
Y  =  0 
Z  =  Z0. 

In  the  case  of  stereo,  the  nominal  positions  of  p  in  the  two 
stereo  views  are  given  by 


(x.)  =  < KoHzo ), 


where  d  is  the  stereo  baseline.  Consequently, 

{xl)~{xr)=S± 

where  S  is  given  by  (26). 

We  have  for  motion  that 


IK  \  — 

(*o)  “  ( z0 )  -  r 


It  then  follows  that 


(ll}‘W=(^KW' 

Hence,  using  equations  (35)  and  (37),  we  find  that 


(x,)  -  (Xp) 
(xl)  -  (**) 


-  1  /_  > 1  (&) 

-  (Z0)  -  T'X°'  S  d 

-  I_(£o)_(xo> 

~  d{Z0)-T  S 

-  J(«T' 


Equation  (36)  can  then  be  used  to  write  this  in  the  form 


(xt)  -  (x0)  _  T  (x,) 
(xt)  -  (xr)  d  S 


We  conclude  from  (31)  that 


Q  —  ^po*  ^cam  ^perit 


_  _Axt  +  AxR 

PM"  AxT+AxT  (4,) 

is  the  contribution  to  a  from  the  uncertainty  in  the  image 
points, 

~  ^  (42) 

is  the  contribution  to  a  from  the  camera  parameters,  and 

<Vrt  =  (43) 

We  use  the  subscript  “peri”  to  refer  to  the  latter  contribu¬ 
tion  because  depends  on  (x,);  the  latter  is  a  measure 
of  how  far  away  the  image  point  is  from  the  FOE,  i.e.,  how 
‘‘peripheral”  the  motion  of  the  object  is. 

The  quantity  ?p„j  is  bounded  from  above.  This  follows 
from  the  obvious  fact  that  if  the  image  point  p  is  to  be 
seen  in  the  second  frame  of  the  motion  sequence,  it  must 
be  within  the  boundary  of  the  image.  That  is  (recalling 
that  we  have  chojen  the  direction  of  motion  of  the  camera 
to  be  along  the  Z-axis,  i.e.,  the  FOE  is  at  the  origin  of  the 
image  coordinate  system), 

(*i)  ^  y 

Recalling  the  definition  (26)  of  the  quantity  5,  we  find  that 

N/2  _  N/2 

<Tp*riS  5  N/(2  tan-y/2)' 


689 


Hence, 


0p«ri 

<  tan  7/2. 

(44) 

We  therefore  find  that 

f.t  _ 

-a<T\ 

f^+^ltanl. 

(45) 

f  mot 

~  d 

1  Ax0  +  Ax,  )  2 

If  the  quantity  on  the  right-hand  side  of  this  inequality  is 
less  than  unity,  then  stereo  will  always  give  less  uncertain 
depth  measurements  than  motion.  In  cases  where  the  un¬ 
certainty  in  the  position  of  image  points  is  comparable  in 
stereo  and  motion,  then  the  factor  in  curly  brackets  is  ap¬ 
proximately  unity.  In  this  case,  then,  stereo  will  always  give 
less  uncertain  depth  measurements  than  motion  whenever 

£tan|*l.  (46) 

If  we  define  the  angle  0  to  be  the  angle  which  the  stereo 
baseline  d  subtends  at  a  distance  of  one  inter-frame  camera 
translation  T,  i.e., 

tan  9  =  (47) 

then  the  condition  (46)  is  that  the  angle  6  be  greater  than 
or  approximately  equal  to  one-half  the  field  of  view: 

etl  (48) 

If  this  condition  is  satisfied  then  stereo  will  give  less  uncer¬ 
tain  depth  measurements  than  motion.  If  this  condition  is 
not  satisfied,  then  the  question  of  which  will  give  less  uncer¬ 
tain  depth  measurements  is  open,  and  must  be  addressed 
by  consulting  the  expression  for  a  given  in  (40). 

We  summarize  the  results  of  this  section: 


4.2.1  An  example 

We  illustrate  these  results  with  a  numerical  example.  Sup¬ 
pose  that  we  have  a  stationary  object  located  at  (5,0,  Z)  at 
time  /  =  0.  We  will  assume  that  the  uncertainties  in  the 
positions  of  interest  points  are  comparable  in  stereo  and 
motion,  so  that  ffp0,  S!  1.  The  motion  efficacy  parameter  is 
then  given  by 

1491 


But 


(*»>  =  5  JZf' 


and  hence,  using  the  definition  of  S, 


(50) 


T  6 
a  =  —  ■ 


d  Z-T 

We  then  have  that  <7=1  when  Z  —  Zmxx,  where 


=  TK}' 


(51) 


(52) 


Consequently,  a  >  (<)1  when  Z  <  (>)Zm“.  Therefore, 
stereo  will  give  less  uncertain  results  than  motion  when 
Z  >  Zm** ,  and  motion  will  give  less  uncertain  results  than 
stereo  when  Z  <  Zmt*. 

We  must,  of  course,  require  that,  in  the  case  of  motion, 
the  object  be  visible  in  the  second  frame.  This  means  that 
Z  >  Zmin,  where 


Zmm  -T  + 


tan(7/2) ' 


(53) 


•  In  general,  motion  will  give  less  uncertain  depth  mea¬ 
surements  than  stereo  when  a  >  1,  and  stereo  will 
give  less  uncertain  depth  measurements  than  motion 
when  a  <  1,  where  a  is  given  by  (40): 


In  summary,  then,  stereo  is  more  accurate  than  motion 
when 

Z  >  Z"“,  (54) 

and  motion  is  more  accurate  than  stereo  when 


O  —  ffpo» 


1M 

d  AT/2 


•  When  <7pot  2*  1,  motion  will  be  less  (more)  uncertain 
than  stereo  when  the  displacement  of  a  point  between 
two  frames  is  larger  (smaller)  than  the  disparity  of  the 
two  points  between  the  two  views. 

•  When  <7poa  —  1,  stereo  will  be  less  uncertain  than 
motion  whenever 


•  If  motion  is  less  uncertain  than  stereo  for  any  point  in 
the  image,  it  is  even  less  uncertain  for  more  periph¬ 
eral  points  (assuming  comparable  uncertainties  in  the 
positions  of  the  corresponding  image  points). 


Zmin  <  Z  <  Zm.  (55) 

In  Tables  2  and  3,  we  give  the  values  for  Zm'B  and  Zmtx 
(in  meters)  for  stationary  objects  6  =  2,  5  and  10  (me¬ 
ters)  off  the  camera's  translational  axis,  and  for  stereo  base¬ 
lines  of  d  =  0.5  meters  (Table  2)  and  d  =  1.0  meters  (Ta¬ 
ble  3).  We  assume  that  the  camera  is  moving  at  10  km/hr 
(=  2.78  m/sec),  taking  one  image  per  second,  so  that  T  = 
2.78  m.  We  take  the  field  of  view  of  the  camera  to  be 
7  =  45°. 

It  is  clear  from  this  example  that  an  uncertainty  analysis 
such  as  the  one  performed  in  this  section  can  be  used  to 
decide  (depending  on  the  physical  circumstances)  which  one 
of  two  (or  more)  modules  should  be  used  to  most  accurately 
recover  a  particular  environmental  parameter. 


690 


8  (meters) 

Zm'*  (meters) 

Z““  (meters) 

2 

7.6 

13.9 

5 

14.9 

30.6 

10 

26.9 

58.4 

Table  2:  Z™"  and  Z““  for  stereo  baseline  d  =  0.5  meters. 


6  (meters) 

Zm,n  (meters) 

Z““  (meters) 

2 

7.6 

8.3 

5 

14.9 

16.7 

10 

26.9 

30.6 

Table  3:  Z™**  and  Z”"  for  stereo  baseline  d  =  1.0  meters. 

5  Further  applications  of  uncertainty 
analysis 

We  have  indicated  in  this  work  many  possible  applications 
of  uncertainty  analysis  to  computer  vision.  We  would  like  to 
mention  only  one  other  possibility.  In  many  approaches  to 
computing  the  shape  of  environmental  objects,  the  relative 
position  of  at  least  two  (and  usually  more)  environmen¬ 
tal  points  is  used  to  compute  the  shape  of  environmental 
structures.  If,  however,  the  uncertainty  regions  of  these 
two  points  overlap  “significantly”  (a  quantitative  definition 
of  this  could  be  made  along  the  same  lines  as  that  of  the 
confidence  level  (27)  in  section  3),  then  one  can  argue  that 
the  two  points  are  not,  in  fact,  distinguishable,  and  hence 
the  shape  of  the  environmental  structure  cannot  be  found. 

We  believe  this  observation  has  important  implications  for 
many  “Shape  from  X”  algorithms.  This  idea  is  currently 
under  study. 

6  Acknowledgements 

I  would  like  to  thank  Prof.  E.  Riseman  for  his  encourage¬ 
ment  and  advice  at  numerous  stages  in  this  work,  and  Dr. 

P.  Anandan  for  many  helpful  discussions.  I  also  thank  Dr. 

I.  Pavlin  for  asking  an  important  question. 

References 

(Anan86)  P.  Anandan,  “Measuring  visual  motion  from  im¬ 
age  sequences,”  Ph.  D.  Thesis,  UMass  COINS, 
Amherst,  Ma.  (Dec.  1986). 

(Adiv85j  G.  Adiv,  “Interpreting  Optical  Flow,”  Ph.  D. 
Thesis,  UM  ss  COINS,  Amherst,  Ma.  (Sept.  1985). 

[Bhar85]  S.  Bharw;  ni,  E.  Riseman,  and  A.  Hanson,  “Re¬ 
finement  of  <  nvironmental  depth  maps  over  multiple 
frames,”  D/  RPA  I.  U.  Workshop  Proc.,  Dec.  1985. 


[Bhar86]  S.  Bharw  ni,  E.  Riseman,  and  A.  Hanson,  “On 
using  Uncr  tainty  Analysis  to  compute  accurate 
depth  maps  Tor  a  mobile  robot,”  COINS  Tech.  Rep. 
(in  prepara: : on). 

[Cann83]  J.  Canny.  “Finding  edges  and  lines  in  images,” 
MIT  Tech.  Rep.  No.  720  (1983). 

[Faug86]  O.  Faugei  s,  N.  Ayache,  B.  Faverjon,  and  F.  Lust- 
man,  “Buil  ing  visual  maps  by  combining  noisy 
stereo  mea  irements,”  IEEE  Int.  Conf.  Rob.  and 
Aut.  (Apr.  ,986). 

[Genn80]  D,  Genn<  ry,  “Modelling  the  environment  of  an 
exploring  v>  tide  by  means  of  stereo  vision,”  Ph.  D. 
Thesis,  SAL.,  Stanford,  Cal.  (June  1980). 

[Kamg86]  B.  Kam  ar-Parsi,  “Practical  computation  of 
pan  and  tilt  angles  in  stereo,”  U.  Mary.  CAR  Tech. 
Rep.  CAR-  i'R-195  (Mar.  1986). 

[Kitcl982]  L.  Kitcl;  n  and  A.  Rosenfeld,  “Grey-level  corner 
detection,”  ’att.  Rec.  Lett.  1  (1982),  95-102. 

[Longl980]  H.  C.  I  >nguet-IIiggins  and  K.  Prazdny,  “The 
interpretati  n  of  a  moving  retinal  image,”  Proc. 
Roy.  Soc.  I  n.  B208  (1980),  385-397. 

[Marr82]  D.  Marr,  Vision,  W.H. Freeman,  San.  Fran.,  Ca. 
(1982). 

[Matt86]  L.  Matth  s  and  S.  Shafer,  “Error  modelling  in 
stereo  navjg  ition,”  CMU  Tech.  Rep.  CMU-CS-86- 
140  (1986). 

[Moral980]  H.  P.  X  oravec,  “Obstacle  avoidance  and  navi¬ 
gation  in  tl:  ■  real  world  by  a  seeing  robot,"  Ph.D. 
Thesis,  SA1.,,  Stanford,  Cal.  (Sept.  1980). 

[Roac80]  J.  Roach  and  J.  Aggarwal,  “Determining  the 
movement  of  objects  from  a  sequence  of  images,” 
IEEE  Trans.  PAMI,  vol.  PAMI-2,  554-562,  (Nov. 
1980). 

[Slam80]  C.  Slama,  ed.  Manual  of  Photogrammetry,  Am. 
Soc.  of  Photog.,  Falls  Church,  Va.  (1980). 

(Snyd86a]  M.  A.  Snyder,  “The  Accuracy  of  3D  parame¬ 
ters  in  Correspondence-based  Techniques,”  UMass 
COINS  Tech.  Rep.  86-28,  June  1986. 

(Snyd86b]  M.  A.  Snyder,  “The  detection  of  moving  ob¬ 
jects,”  UMass  COINS  Tech.  Rep.  in  preparation. 

[Snyd86c]  M.  A.  Snyder,  “The  relative  accuracy  of  stereo 
and  motion  for  depth  recovery,”  UMass  COINS 
Tech.  Rep.  in  preparation. 

{Ullm79]  S.  Ullmann,  The  Interpretation  of  Visual  Motion, 
MIT  Press,  Camb.,  Ma.  (1979). 


691 


Figure  3:  The  NDP  passes  through  the  vertical  sides  of  all 
the  uncertainty  rectangles  (Case  VW) 


R 


Figure  4:  The  search  region  The  shaded  region  £s  is 
the  intersection  of  the  search  rectangle  H|  with  the  “wedge* 
lying  below  I —  and  above  L-i.. 


693 


Results  of  Motion  Estimation  With 
More  Than  Two  Frames’ 

Hormoz  Shariat 
and 

Keith  E.  Price 

Institute  for  Robotics  and  Intelligent  Systems 
Electrical  Engineering  -  Systems 
University  of  Southern  California 
Los  Angeles,  CA  90089-0273 


Abstract 

This  paper  reports  on  the  results  of  a  general 
motion  estimation  system  based  on  using  three  or  more 
frames.  The  entire  sequence  is  treated  as  a  whole,  thus 
simplifying  the  process  of  estimating  the  motion 
parameters,  and  enabling  the  computation  of  the  motion 
parameters  in  terms  of  the  natural  center  of  motion.  We 
present  a  brief  description  of  the  method,  and  results  for 
real  and  synthetic  images  for  the  case  of  1  point  in  5  (and 
6)  frames. 

1.  Introduction 

Motion  analysis  has  been  a  problem  in  computer 
vision  for  a  long  time,  with  many  different  approaches 
and  tasks.  The  problem  of  estimating  motion  parameters 
from  a  sequence  is  one  of  the  many  tasks  in  the  motion 
analysis  domain.  In  this  paper,  we  discuss  a  new  method 
for  the  estimation  of  three-dimensional  motion 
parameters  given  a  sequence  of  views  of  a  moving  scene. 
The  application  is  general  motion  estimation,  but  the 
immediate  domain  is  in  an  autonomus  land  vehicle  where 
the  vehicle  and  objects  in  the  scene  can  both  move,  but 
where  the  vehicle  (camera)  motion  is  generally  known. 
We  have  adopted  some  concepts  from  other  researchers, 
but  the  primary  difference  is  to  consider  the  entire 
sequence  at  the  same  time  rather  than  to  consider  the 
sequence  as  a  set  of  pairs  of  images.  This  produces  many 
important  constraints  that  do  not  exist  when  only  two 
frames  are  used.  These  constraints  are  used  to  derive 
equations  which  can  be  used  to  compute  the  motion 
parameters  when  a  few  points  are  matched  in  a  few 
frames  of  the  sequence. 

We  first  summarize  past  work  in  motion  estimation 
through  a  tabular  listing.  This  is  followed  by  the  brief 
description  of  our  method,  with  the  details  given  in  [I]. 
Results  of  applying  this  method  to  synthetic  and  real 
motion  sequences  are  given. 


*This  research  was  supported  in  part  by  DARPA  contracts 
DACA76-85-C-0008  and  F33815-84-K-1404,  order  No  3119  and 
monitored  by  the  U.S.  Army  Engineer  Topographic  Laboratories  and 
the  Air  Force  Wright  Aeronautical  Laboratories  respectively. 


Table  1-1  gives  an  overview  of  many  of  the 
different  methods  for  motion  estimation.  These  are 
classified  according  to  the  number  of  points  required  in 
how  many  frames  (included  are  methods  that  use  lines  or 
optical  flow  as  input);  the  type  of  equations  generated 
(linear,  non-linear,  polynomial);  the  kinds  of  results 
presented;  a  short  comment  on  the  major  strengths  and 
limitations;  and  references.  This  table  is  meant  to  give 
an  overview  of  various  current  methods,  and  not  a 
complete  evaluation  of  all  the  methods. 

2.  Overview  of  the  Multiframe  Method 

In  this  section,  we  introduce  our  method  and  its 
implementation.  The  actual  task  of  motion  estimation  is 
done  by  the  four  modules:  "General  Motion  Estimator," 
"Pure  Rotation  Estimator,"  "Pure  Translation 
Estimator,"  and  "Initial  Guess  Generator."  The  General 
Motion  Estimator  and  the  Initial  Guess  Generator  are 
discussed  in  [1].  The  pure  rotation  and  translation  cases 
are  treated  in  [16]. 

I  Feature  Correspondence  Extractor  -  For  the  details 
of  the  methods  used  for  the  results  reported  here 
see  [17]  and  [18], 

2.  Motion  Classifier  -  The  program  picks  the  category 
that  best  describes  the  behavior  of  the  point  on  the 
image  plane. 

3.  Pure  Rotation  Estimator  -  This  module  assumes 
that  the  input  data  represent  a  pure  rotation,  and 
estimates  the  rotation  parameters  which  best  fit  the 
input  data.  This  is  discussed  more  fully  in  [16]. 

4.  Pure  Translation  Estimator  -  This  module  assumes 
that  the  input  data  represent  a  pure  translation, 
and  estimates  a  translation  vector  which  best  fits 
the  input  data.  This  output  can  also  be  used  as  a 
quick  collision  detector.  This  is  discussed  more 
fully  in  [16]. 

5.  Translation  Compensator  -  This  module  adjusts  the 

input  data  to  compensate  for  the  computed 

translation. 

6.  Rotation  Compensator  -  This  module  adjusts  the 

input  data  to  compensate  for  the  computed 

rotation. 


594 


No.  Points 
is  Frames 

Equation 

Type 

Results 

Strength 

Limitation 

Source 

OF+Depth 

Nonlinear 

Synthetic 

Solve  equations 
using  Hough 

Need  depth 

[21 

1  in  25-1- 

Nonlinear 
Kalman  Filter 

Synthetic 

Large  noise 

Complexity 
many  matches 

[31 

100+  frames 

Linear 

Real  data 

No  matches 

Much  data 
complex  for  general 

M 

2  in  4 

Nonlinear 

Real  data 

Jointed  rigid 
objects 

Parallel  projection 

[51 

4  in  3 

2nd  order 
polynomial 

none 

Simple  equations 

no  Structure 

[«1 

2  in  3 

2nd  order 
polynomial 

none 

Simple  equations 

Parallel  projection 
Needs  rotation 

PI 

1  in  3 

2n<^  order 
Polynomial 

none 

Simple  equations 

Rotation  only 

[81 

3  in  3 

or  2  in  4 
or  1  in  5 

Mostly  2nd 
order 

polynomials 

none 

none 

real  data 

Simple  equations 
no  initial  guess 

Small  motions 

[•1 

4  lines  in  3 

2nc*  order 
polynomial 

Synthetic 

lines 

Simple  equations 

need  initial 
guesses 

[10] 

6  in  3 

Nonlinear 

Synthetic 

Uses  lines 

Sensitive  to  errors 

[Hi 

2  in  3 

Mostly  Linear 

Synthetic 

Simple 

only  Parallel 
Projection 

[12] 

5  in  2 

2nd  Order 

Real  and 
Synthetic 

Rigidity 

Constraint 

Only  2  frames 

[13] 

6  in  2 

Linear 

Synthetic 

Rigidity 

Theoretical 

analysis 

Restricted 

surfaces 

[Ml 

8  in  2 

Linear 

Synthetic 

Unique 

solution 

Restricted 
point  location 

[15] 

Table  1-1:  Motion  Estimation  Using  Feature  Point  Matches 
Survey  of  past  results 


7.  Initial  Guess  Generator  -  This  module 
automatically  generates  initial  guesses  based  on  an 
analysis  of  the  formulation. 

8.  General  Motion  Estimator  -  This  module  uses  the 
Gauss-Newton  method  along  with  the  guesses 
provided  by  the  "Initial  Guess  Generator"  to 
iteratively  solve  the  nonlinear  general  motion 
equations. 

9.  Error  and  Confidence  Evaluator  -  For  each 
computed  parameter  we  calculate  an  estimate  of 
the  error  in  the  parameter,  and  a  measure  of  how 
confident  the  program  is  in  its  estimated  error 
measure. 

3.  Results 

We  have  applied  the  method  to  a  number  of 
computer  generated  test  sequences  and  a  few  real  image 
sequences.  The  test  data  covered  the  complete  range  of 
possible  motions  from  pure  translation  to  pure  rotation, 
and  included  both  small  and  large  motions.  Using 
stochastical  techniques,  a  thorough  error,  sensitivity,  and 
performance  analysis  of  the  method  was  also  performed. 

On  the  average,  the  program  converged  to  the  right 
answer  for  most  noiseless  data  within  10  iterations.  For 


noisy  data,  the  answers  were  reasonable  and  their 
accuracy  was  directly  a  function  of  the  amount  of 
"Relative  Noise"  in  the  input  data.  "Relative  Noise"  is 
defined  as: 

Relative  Noise  = 

The  amount  of  noise  in  the  disparity  vector 
The  length  of  the  disparity  vector 

The  average  time  per  iteration  for  the  case  of  1 
feature  in  5  frames  was  0.556  seconds  and  for  the  case  of 
1  feature  in  6  frames  was  0.843  seconds  on  the  Symbolics 
3640  (with  the  garbage  collector  on).  Note  that  these 
times  are  short,  especially  when  compared  to  the  time 
required  for  feature  extraction  and  matching  (or 
calculating  optical  flow). 

The  coordinate  system  is  located  at  the  focal  point 
of  the  camera.  It  is  oriented  such  that  the  plane  defined 
by  X-  and  Y-axes  is  parallel  to  the  image  plane,  and  the 
Z-axis  is  pointing  away  from  the  camera. 

For  synthetic  test  data,  all  measurements  and 
calculations  are  given  in  terms  of  multiples  (or  fractions) 
of  the  focal  length,  which  is  assumed  to  be  1.  For  the 
real  data,  the  focal  length  is  50  millimeters  and  all 


695 


measurements  and  calculations  are  in  millimeters.  Note 
that  for  the  real  cases,  the  dimensions  of  the  images 
(given  in  x  and  y  directions)  are  36  X24  mm.  (for  the 
"Turning  Car"  sequence)  and  36X36  mm.  for  the 
"Crossing  Car"  sequence.  Also,  notice  that  in  all  cases, 
the  calculated  motions  are  scaled  to  the  camera’s  focal 
plane.  (This  was  arbitrarily  chosen  since  the  absolute 
depth  information— the  scaling— is  lost  during  the  imaging 
process.) 

3.1.  Synthetic  Test  Data 

It  is  intended  that  the  results  given  in  this  section 
serve  only  as  examples  of  the  behavior  of  the  program 
under  various  types  of  motion.  One  must  study  the 
statistical  results  reported  in  [l]  and  [16]  before  making 
any  general  conclusions  about  the  performance  of  the 
program.  Synthetic  test  data  was  generated  to  test  all 
portions  of  the  motion  program.  The  three  test  cases 
reported  here  are  for  general  motion  only. 

1.  Case  #1  -  An  "easy"  case,  where  both  the 
translation  and  rotation  are  large. 

2.  Case  #2  -  A  "typical"  case. 

3.  Case  #3  -  A  "worst"  case,  where  both  the 
translation  and  rotation  are  very  small.  This  case 
was  chosen  to  study  the  source  and  effect  of  various 
sources  of  errors. 

Table  3-1  lists  the  parameters  for  each  of  these 
cases.  Note  that  in  this  table  (and  other  tables  in  this 
section),  the  direction  of  the  axis  of  rotation,  which  has 
unit  length,  is  identified  with  the  angles  7  and  X  which 
are  defined  in  Figure  3-1.  The  relative  noise  in  the  Table 
is  caused  by  the  (simulated)  digitization  process  on  the 
image  plane  (assuming  a  focal  length  of  1,  a  60°  field  of 
view  and  a  512*512  image  plane),  and  represents  the 
average  noise  of  the  disparity  values.  In  this  table,  Davg 
is  the  average  magnitude  of  disparity  vectors  given  in 
pixel  units. 


Figure  3-1:  Definition  of  Angles:  7  and  X 


For  each  of  these  3  cases,  the  corresponding  motion 
parameters  were  used  to  generate  an  imaginary  moving 
point  in  space.  Then,  the  image  of  this  moving  point  on 
the  image  plane  was  sampled  to  get  our  test  data.  This 
could  give  either  "ideal"  (noise  free)  test  data  or 
simulated  digital  data. 

Table  3-2  has  the  results  for  both  the  5  and  6  frame 
cases,  including  the  calculation  errors  and  number  of 
iterations  needed  for  convergence.  The  table  also  gives 
the  range  of  error  and  performance  measures  generated 
by  the  program  for  each  test  case. 

Figure  3-2  shows  the  convergence  of  the  best  3 
guesses  to  the  solution.  Figures  3-3  through  3-5  compare 
the  calculated  3-D  motions  with  the  input  data  for  test 
cases  #1,  #2,  and  #3  (for  both  the  5  or  6  frame  case). 

The  tags  (arrows)  denote  the  location  of  the  input  data 
(hollow  circles)  at  the  time  of  each  frame.  The  curve 
represents  the  calculated  3-D  motion  after  it  is 
interpolated  and  projected  back  onto  the  image  plane. 

As  Table  3-2  shows,  even  though  cases  #1  and  #2 
have  reasonable  errors,  case  #3  has  large  errors.  This  is 
due  to  the  fact  that  the  motion  (and  especially  the 
rotation)  is  very  small,  which  causes  the  following 
problems: 

1.  T  is  so  small  that  its  information  is  lost  in  the 
noise. 

2.  Since  the  axis  of  rotation  is  calculated  by 
computing  the  cross  product  of  two  vectors,  it  is 
sensitive  to  the  noise  in  those  two  vectors.  Thus,  if 
the  relative  noise  in  each  of  those  two  vectors  is 
high,  then  the  noise  in  the  axis  of  rotation  is  even 
higher. 

3.  Since  the  disparity  vectors  on  the  image  plane  are 
small  (their  average  is  4.9  pixels),  the  relative  image 
digitization  error  becomes  high  (for  this  case  it  was 
14.8%  of  the  average  disparity). 

4.  Because  the  disparities  are  small,  the  matrices  used 
for  the  Gauss-Newton  solution  scheme  become 
nearly  singular,  which  cause  large  errors  during 
their  inversion. 

5.  The  round-off  errors  of  the  calculations  (i.e.  the 
precision  of  the  computer)  start  having 
deteriorating  effects. 

F rom  this  table  (3-2),  we  may  also  see  that 
observing  an  extra  frame  (the  6tl*  frame)  does  improve 
the  results  for  case  #1,  but  not  for  cases  #2  and  #3. 
This  shows  that  observing  more  points  or  more  frames 
does  not  necessarily  improve  the  results.  If  the  additional 
data  introduces  more  noise  than  useful  information,  then 
we  expect  that  the  results  will  become  more  erroneous. 
As  Figure  3-4  shows,  observing  the  6th  frame  introduces  a 
disparity  vector  which  is  even  smaller  than  the  previous 


696 


disparities.  This  introduces  more  noise  in  the  input  data, 
and  hence,  increases  the  error  of  the  calculated 
parameters. 

It  must  be  noted  that  according  to  the  Table  3-2, 
the  program  has  accurately  assessed  its  performance.  For 
example,  it  gave  small  error  measures  for  test  cases  #1 
and  #2  to  show  the  accuracy  of  the  results.  However,  it 
gave  high  error  measures  along  with  high  confidence 
numbers  for  test  case  #3.  These  show  that  the  program 
is  very  "confident"  that  the  results  are  erroneous. 

Finally,  let  us  note  that  because  of  the  proximity  of 
the  generated  initial  guesses  to  the  solution,  very  few 
iterations  were  needed  for  convergence  (in  cases  #1  and 
#2). 

3.2.  Real  Test  Data 

In  this  section  two  sequences  of  real  test  images  are 
treated.  In  these  two  cases,  because  the  exact  motion  of 
the  object  is  unknown,  we  can  only  intuitively  determine 
if  the  answers  are  correct.  Plus  we  can  compare  the 
results  for  the  different  points  on  the  object  to  determine 
if  the  results  are  reasonably  consistent.  To  determine 


point  correspondences,  we  used  the  methods  described 
in  [18,  17]  to  segment  the  image  into  regions  and  then 
match  the  regions  on  the  basis  of  their  shape,  size  and 
position.  The  segmentation  program  was  run  on  each 
frame  separately,  without  any  guidance  from  the  previous 
frames,  and  the  matching  was  done  as  a  series  of  pairwise 
matches.  The  feature  points  used  for  the  motion 
computations  are  the  centers  of  the  regions  in  each  view. 
These  are  the  pointes  plotted  on  the  images  in  the 
display  of  the  results.  The  times  are  for  the 
implementation  on  a  Symbolics  3645  with  garbage 
collection  on  (which  improves  the  perfromance 
substantially)  and  include  some  overhead  operations,  but 
not  the  display  of  the  results. 

3.2.1.  Turning  Car  Sequence 

Figure  3-8  shows  6  frames  of  a  turning  car 
sequence.  This  sequence  was  taken  by  a  motor  dirven 
camera  at  a  rate  of  about  3  frames  per  second.  The  car 
was  being  driven  at  a  constant  speed  while  turning  in  a 
circle.  The  segmentation  produced  many  regions  in  all 
the  images,  but  most  of  the  regions  represented 
stationary  objects.  The  motion  classifier  determined  that 
most  of  the  sequences  of  five  or  six  matching  regions  (a 
region  that  can  be  traced  through  five  or  six  views  of  the 


Test  Case  #1 

Test  Case  #2 

Test  Case  #3 

Motion 

Large  Tr  &  Rot 

Typical  Tr  &  Rot 

Small  Tr  &  Rot 

T 

(1,  2,  3) 

(2,  2,  2) 

(0.01,  0.02,  0.03) 

CRot 

(0,  -1,  2) 

(1,  1,  3) 

(5,  5,  10) 

Tf 

30" 

45’ 

60' 

X 

30* 

45' 

60' 

$ 

45* 

20' 

5' 

RRot 

1 

3 

1 

Rel.  Noise 

3.1% 

0.7% 

14.8% 

D„g  (pixels) 

65.7 

181.96 

4.94 

Table  3-1:  General  Motion  Test  Cases 


Test  Case  #1 

Test  Case  #2 

Test  Case  #3 

T  (Length) 

3.25%  (2.49%) 

3.71%  (1.09%) 

225.2%  (434.2%) 

T  (Direction) 

0.91'  (0.63') 

9.87'  (10.8') 

121.0'  (20.91') 

(Direction) 

11.35'  (8.35') 

52.64'  (53.1') 

42.27'  (118.1  * ) 

t 

0.96'  (0.56') 

5.47'  (8.28') 

81.47'  (114.1') 

CRot 

0.62%  (0.44%) 

17.94%  (21.76%) 

9.26%  (9.14%) 

^Rot 

4.5%  (1.46%) 

43.67%  (54.66%) 

97.89%  (98.51%) 

Iterations 

4(3) 

3(2) 

430  (200) 

Error  Measure 

ior10-o.2 

10"8-2.25 

60-124 

Confidence  Measure 

71-94 

71-93 

70-95 

Table  3-2:  Error  Performance  Results  of  General  Motion  Cases  for  5  Frames 

(and  for  6  frames) 


697 


scene)  were  pure  or  dominant  rotations.  Many  of  the 
pure  rotation  sequences  are  given  in  the  companion  paper 
[16].  In  this  section  we  will  present  the  results  for  two 
sets  of  points  (see  Figure  3-7  where  the  results  are 
displayed  on  the  fifth  frame  of  the  six  frame  sequence). 
The  sequence  labeled  "1"  corresponds  to  the  passenger 
windows  that  was  matched  in  frames  2  through  6,  thus 
the  last  point  (with  the  pointer)  is  the  location  of  the 
window  in  the  frame  after  the  displayed  image.  Sequence 
"2"  is  the  driver’s  side  tail  light  for  frames  1  through  5. 
The  results  are  summarized  in  Table  3-3. 


Figure  3-2:  Convergence  of  Top  3  Guesses  to 
a  Single  Solution 


5  Frames  Observed  6  Frames  Observed 

Figure  3-3:  Calculated  3-D  Motion  for  Case  #1 


8  Frame*  Observed  8  Frames  Observed 

Figure  3-4:  Calculated  3-D  Motion  for  Case  # 2 


698 


Feature  #1 


Feature  #2 


T 

(-0.23,  -0.60,  -10.73) 

(-0.33,  -0.31,  4.71) 

ARot 

(-0.42,  -0.88,  0.30) 

(0.03  -0.96,  0.28) 

» 

69.05 ' 

23.12* 

CRot 

(-5.68,  3.46,  50.01) 

(-0.6C,  0.46,  49.98) 

^Rot 

0.34 

7.08 

EM/CM 

<  .007-13/70-100 

11-23/56-100 

Time  (Sec) 

82 

5307 

Table  3-3:  Results  for  Turning  Car  Sequence  -  (General  Motion) 


599 


The  first  feature  almost  fits  the  translation  model 
(the  points  appear  almost  along  a  straight  line);  the  small 
radius  of  rotation  shown  in  Table  3-3  agrees  with  this. 
The  second  feature  is  very  close  to  a  pure  rotation,  but 


IS  M  -11  -»••• 

X 

Calculated  Motion 


Figure  3-7:  Comparison  of  Input  Data  (hollow  circles)  to 
Calculated  Motion  (solid  curve)  for  Turning  Car 

the  general  motion  computation  yields  a  better  fit  to  the 
data  than  the  rotation  computation.  The  times  include 
approximtely  60  seconds  for  the  calculation  of  the 
possible  initial  guesses  and  the  computation  of  the 
residuals  for  those  guesses.  The  large  time  for  Feature 
#2  is  because  many  of  the  initial  guesses  converged  to  an 
invalid  set  of  coefficients  (C2  and  C4  both  equal  to  0); 
these  values  are  rejected  and  the  next  set  of  guesses  is 
tried  until  a  valid  set  of  coefficients  results. 

3.3.  Crossing  Car  Sequence 

Figure  3-8  shows  6  frames  of  the  "Crossing  Car" 
sequence.  These  images  were  digitized  from  a  video  tape, 
with  one  image  for  each  field  of  the  television  signal  (60 
frames  per  second).  The  first  frame  for  the  matched  data 
is  frame  number  13  with  frame  number  17  (the  fifth  in 
the  subsequence)  used  to  display  the  results.  The  basic 
motion  is  a  translation,  but  because  there  is  some  motion 
of  the  camera  the  apparent  motion  of  the  truck  is  not 
exactly  a  translation.  There  are  two  sets  of  results,  the 
first  is  for  5  frames  (Figure  3-9)  and  the  second  is  for  six 
frames  (Figure  3-10).  The  results  displayed  on  frame 
number  17  have  open  circles  for  the  centers  of  the  regions 
which  were  matched  in  the  input  sequence.  Tables  3-4 
and  3-5  summarize  the  motion  parameters  for  these  two 
sets  of  results. 

Feature  point  1  (Figure  3-9  and  Table  3-4)  and 
Feature  points  1  and  2  (Figure  3-10  and  Table  3-5)  are 
points  on  the  body  of  the  truck.  The  first  few  feature 
point  locations  are  closer  together  because  the  entire 
truck  was  not  visible  in  these  frames.  Feature  point  2  (5 
frame  case)  is  the  front  wheel.  Feature  point  3  (5 


frames)  and  Feature  points  3  and  4  (6  frames)  are  the 
rear  wheel  of  the  truck.  The  motions  for  the  wheels 
(both  the  front  and  rear,  for  both  the  five  and  six  frame 
cases)  are  consistent  with  each  other;  the  translations  and 
rotations  are  very  similar.  The  primary  motion  is  from 
the  translation,  with  the  rotation  accounting  for  the 
slight  up  and  down  motion  caused  by  the  camera.  Even 
though  the  amount  of  rotation,  0,  is  large,  the  radius  of 
rotation  is  small.  Hence  the  the  translation  component 
contains  most  of  the  motion  information. 

The  motion  for  the  cab  of  the  truck  differs  beacuse 
the  location  used  for  the  region  is  its  center  and  the  first 
few  views  do  not  include  the  entire  truck  body.  For  this 
case,  the  motion  computed  from  six  views  was  much 
closer  to  the  motion  computed  for  the  wheels.  Since  the 
Z-component  of  the  translation,  Tz,  is  calculated  from 
the  differences  of  the  spacing  of  the  points  on  the  image 
plane,  small  perturbations  of  these  points  can  translate 
into  large  values  for  Tz-  This  causes  the  inconsistencies 
for  the  Z-component  in  both  sets  of  data  (5  frames  and  6 
frames),  while  the  X-  and  Y-componenvs  are  consistent. 

The  highest  error  measures  (EM)  and  the  lowest 
confidence  measures  (CM)  are  for  the  computation  of  the 
Axis  of  Rotation  (ARot).  This  should  be  expected  since 
ARot  is  calculated  form  the  cross  product  of  two  vectors 
and  hence  is  sensitive  to  noise  in  each  of  those  vectors. 
The  times  include  approximately  60  seconds  for  the 
computation  of  the  initial  guesses  for  the  5  frame  case 
and  300-400  seconds  for  the  6  frame  case.  The  extreme 
time  for  Feature  #1  (Table  3-4)  is  caused  by  the 
rejection  of  coefficients  that  are  almost  zero  and  the  need 
to  try  more  of  the  possible  guesses  until  a  valid  set  of 
final  coefficients  is  obtained. 

4.  Conclusions 

We  have  presented  the  results  for  the  computation 
of  three-dimensional  motion  parameters  by  using  one 
point  in  five  frames.  This  method  was  tested  on  a  set  of 
synthetic  data  to  prove  the  concept  and  on  a  collection  of 
real  data  to  illustrate  that  it  can  work  well  with  errors 
from  realistic  inputs. 

Contributions  of  this  work  include: 

•  Development  of  an  efficient  method  using  the  time 
flow  information  from  three  or  more  frames  in  a 
sequence. 

•  Development  of  a  scheme  for  automatically 
generating  initial  guesses  for  the  solution.  The 
implementation  of  the  three  point  in  three  frame 
case  should  result  in  an  even  better  performance 
with  real  data. 

•  Generation  of  natural  parameters  of  motion,  which 
simplifies  comparison  of  results  from  one  frame  to 


700 


the  next,  and  accurately  predicts  the  position  of  the  This  work  is  not  a  complete  motion  analysis  system, 

object  in  subsequent  frames.  but  is  an  important  part  of  the  final  system.  There 

•  Development  of  a  robust  software  system  which  has  remains  much  work  in  combining  the  motion  estimation 

been  run  on  a  number  of  different  motion  and  prediction  capabilities  with  feature  based  matching 

sequences.  systems  to  expand  the  capabilities  of  both  the  matching 

•  Use  of  data  from  existing  image  matching  methods  system  and  the  motion  estimation  system, 
for  motion  analysis. 


(a)  1st,  2nd,  and  3rd  Frames  (b)  4th,  5th,  and  6th  Frames 
Figure  3-8:  Six  Frames  of  the  Crossing  Car  Sequence 


701 


.« - ■ — — — —  — =~*-  w 

18.0  9.0  0.8  -9.0  -18.0 


Calculated  Motion 

gure  3-9:  Comparison  of  Input  Data  (hollow  circles) 
to  Calculated  Motion  (solid  curve)  (5  Frames) 
for  Crossing  Car 


CelculetBd  Motion 


Figure  3-10:  Comparison  of  Input  Data  (hollow  circles) 
to  Calculated  Motion  (soli de  curve)  (6  Frames) 
for  Crossing  Car 


Feature  #1 

Feature  #2 

Feature  #3 

T 

(0.83,  -0.15,  3.71) 

(2.14,  -0.01,  0.03) 

(2.0,  -0.03,  0.45) 

ARot 

(0.02,  -0.94,  -0.34) 

(0.17  -0.98,  0.09) 

(0.01,  0.84,  0.54) 

6 

60.05” 

168.6  ” 

156.4” 

CRot 

(-4.70,  -1.30,  50.19) 

(3.67,  -4.89,  50.26) 

(-12.1,  -4.62,  49.91) 

RRot 

0.85 

0.26 

0.11 

EM/CM 

1.6-44/46-61 

lO*10/71-87 

10"9/71-87 

Time  (Sec) 

9856 

193 

87 

Table  3-4:  Results  for  Crossing  Car  Sequence  -  (5  Frames) 


T 

ARot 

e 

CRot 

RRot 

EM/CM 

Time  (Sec) 

Feature  #1 
(1.92,  0.17,  -5.63) 

(-0.05,  0.99,  -0.15) 

114.3” 

(-5.53,  -1.25,  50.09) 

0.16 

.04-30/75-92 

356 

Feature  #2 
(1.92,  0.09,  -4.27) 

(-0.01,  1.0,  -0.08) 

87.37  ” 

(-3.94,  -1.19,  49.76) 

0.27 

1.6-32/73-90 

317 

Feature  #3 

Feature  #4 

T 

(1.93,  -0.03,  0.67) 

(1.95,  -0.02,  0.57) 

ARot 

(-0.09,  0.73,  0.68) 

(-0.11,  -0.96,  -0.25) 

9 

137.0” 

171.7” 

CRot 

(-12.1,  -4.61,  49.95) 

(-10.12,  -4.63,  50.36) 

RRot 

0.08 

0.40 

EM/CM 

.09-39/51-91 

1.3-35/55-93 

Time  (Sec) 

822 

475 

Table  3-5:  Results  for  Crossing  Car  Sequence  -  (6  Frames) 


702 


References 

H.  Shariat  and  K.  Price,  “General  Motion 
Estimation  with  More  Than  Three  Frames,”  IEEE 
Trana.  on  Pattern  Analysis  and  Machine 
Intelligence,  Vol.  XX,  No.  YY,  Submitted  1986, 
pp.  XX- YY. 

D.  H.  Ballard  and  O.  A.  Kimball,  “Rigid  Body 
Motion  from  Depth  and  Optical  Flow,"  Computer 
Vision,  Graphics  and  Image  Processing,  Vol.  22, 
No.  1,  April  1983,  pp.  95-115. 

T.  J.  Broida  and  R.  Chellappa,  “Kinematics  of  a 
Rigid  Object  from  a  Sequence  of  Noisy  Images:  A 
Batch  Approach,”  Conference  on  Computer 
Vision  and  Pattern  Recognition,  IEEE,  Miami 
Beach,  FL,  June  1986,  pp.  176-182. 

R.  C.  Bolles  and  H.  H.  Baker,  “Epipolar-Plane 
Image  Analysis:  A  Technique  for  Analyzing  Motion 
Sequences,”  IEEE  Proc.  of  the  3rd  Workshop  on 
Computer  Vision:  Representation  and  Control, 
Bellaire,  MI,  October  1985,  pp.  168-178. 

J.  A.  Webb  and  J.  K.  Aggarwal,  “Structure  from 
Motion  of  Rigid  and  Jointed  Objects,”  Analysis 
Intelligence,  Vol.  19,  No.  1,  September  1982,  pp. 
107-130. 

D.  T.  Lawton,  “Constraint-Based  Inference  from 
Image  Motion,”  Proc.  of  the  1st  Annual  National 
Conference  on  Arti ficial  Intelligence,  Stanford 
Univ.,  August  1980. 

D.  D.  Hoffman  and  B.  E.  Flinchbaugh,  “The 
Interpretation  of  Biological  Motion,”  Biological 
Cybernetics,  Vol.  42,  No.  3,  1982,  pp.  195-204. 

B.  L.  Yen  and  T.  S.  Huang,  “Determining  3-D 
Motion  and  Structure  of  a  Rigid  Body  Using  the 
Spherical  Projection,”  Computer  Graphics  and 
Image  Processing,  Vol.  21,  1983,  pp.  21-32. 

H.  Shariat  and  K.  E.  Price,  “How  to  Use  More 
Than  Two  Frames  to  Estimate  Motion,”  Proc.  of 
IEEE  Workshop  on  Motion:  Representation  and 
Analysis,  Charlston,  SC,  May  1986. 

A.  Mitiche,  S.  Seida,  and  J.  Aggarwal,  “Line-Based 
Computation  of  Structure  and  Motion  Using 
Angular  Invarience,”  Workshop  on  Motion: 
Representation  and  Analysis,  IEEE,  Kiawah 
Island,  SC,  May  1986,  pp.  175-180. 

Y.  Liu  and  T.  S.  Huang,  “Estimation  of  Rigid 
Body  Motion  Using  Straight  Line 
Correspondences,”  Workshop  on  Motion: 
Representation  and  Analysis,  IEEE,  Kiawah 
Island,  SC,  May  1986,  pp.  47-52. 


[12] .  C.-H.  Lee  and  A.  Rosenfeld,  “Structure  and 

Motion  of  a  Rigid  Object  Having  Unknown 
Constant  Motion,”  Workshop  on  Motion: 
Representation  and  Analysis,  IEEE,  Kiawah 
Island,  SC,  May  1986,  pp.  145-150. 

[13] .  A.  Mitiche,  S.  Seida,  and  J.  K.  Aggarwal, 

“Determining  Position  and  Displacement  in  Space 
from  Images,”  Proc.  of  IEEE  Conference  on 
Computer  Vision  and  Pattern  Recognition,  San 
Francisco,  CA,  June  1985,  pp.  504-509. 

[14] .  X.  Zhuang  and  R.  Haralick,  “Two  View  Motion 

Analysis,”  Proc.  of  IEEE  Conference  on 
Computer  Vision  and  Pattern  Recognition,  San 
Francisco,  CA,  June  1985,  pp.  686-690. 

[15] .  R.  Y.  Tsai  and  T.  S.  Huang,  “Uniqueness  and 

Estimation  of  Three-Dimensional  Motion 
Parameters  of  Rigid  Objects  with  Curved 
Surfaces,”  IEEE  Trans,  on  Pattern  Analysis  and 
Machine  Intelligence,  Vol.  6,  January  1984,  pp. 
13-27. 

[16] .  H.  Shariat  and  K.  Price,  “Motion  Estimation 

Using  Multiple  Frames  Simple  Cases:  Rotation  and 
Translation,”  IEEE  Trans,  on  Pattern  Analysis 
and  Machine  Intelligence,  Vol.  XX,  No.  YY, 
Submitted  1986,  pp.  XX- YY. 

[17] ,  R.  Ohlander,  K.  Price,  and  D.  R.  Reddy,  “Picture 

Segmentation  Using  A  Recursive  Region  Splitting 
Method,”  Computer  Graphics  and  Image 
Processing,  Vol.  8,  1978.  pp.  313-333. 

[18] .  O.  D.  Faugeras  and  K.  E.  Price,  “Semantic 

Description  of  Aerial  Images  Using  Stochastic 
Labeling,”  IEEE  Trans,  on  Pattern  Analysis  and 
Machine  Intelligence,  Vol.  3,  No.  6,  November 
1981,  pp.  633-642. 


TRACING  FINITE  MOTIONS  WITHOUT  CORRESPONDENCE 


Kcn-ichi  Kanatani* 
Tsai-Chia  Chou 


Center  for  Automation  Research 
University  of  Maryland 
College  Park,  MD  20742 


ABSTRACT 

The  3D  motion  of  an  object  having  planar  faces  is  traced, 
starting  from  a  known  position,  from  a  sequence  of  2D  per¬ 
spective  projection  images  without  using  any  knowledge  oj 
point-to-point  correspondence.  Computation  is  based  on  the 
shape  of  the  region  on  the  image  plane  corresponding  to  a 
planar  face  of  the  object.  Given  two  images,  a  heuristic  guess 
about  the  motion  is  first  computed,  and  one  image  is 
transformed  according  to  this  estimated  motion  so  that  it  is 
positioned  close  to  the  other  image.  Then,  the  motion  that 
accounts  for  the  remaining  small  discrepancy  is  estimated  by 
measuring  numerical  features  of  the  planar  regions.  The 
scheme  is  based  on  the  optical  flow  due  to  infinitesimal 
motion,  and  estimation  is  done  by  solving  a  set  of  simultane¬ 
ous  linear  equations.  This  process  is  iterated;  after  each  esti¬ 
mate  of  the  motion,  one  image  is  transformed  according  to  the 
estimated  motion  so  that  it  is  positioned  closer  and  closer  to 
the  target  image.  Various  practical  issues  such  as  choice  of 
features,  constrained  motions,  face  identification,  and  compu¬ 
tation  of  features  without  actually  transforming  the  images  are 
discussed.  Some  numerical  examples  are  also  given. 

1.  INTRODUCTION 

Detection  of  the  3D  structure  of  a  moving  object  from  a 
sequence  of  2D  images  is  one  of  the  most  important  problems 
in  computer  vision  and  is  often  referred  to  simply  as  “struc¬ 
ture  from  motion”.  The  problem  takes  considerably  different 
forms  according  to  the  magnitude  of  motion  that  takes  place 
between  two  successive  frames,  i.e.,  “finite  motion”  or 
“infinitesimal  motion”. 

Given  a  point-to-point  correspondence  between  two  suc¬ 
cessive  frames,  vectors  are  obtained  on  the  image  plane  indi¬ 
cating  where  each  point  moves  from  one  frame  to  the  next. 
Let  us  call  these  vectors  the  displacement  field.  If  the  motion 
is  infinitesimally  small,  the  displacement  field  is  identified  with 
the  velocities  of  image  points  (the  time  interval  between  the 
two  frames  is  regarded  as  unit  time)  and  is  often  termed  the 
optical  flow. 

Analysis  of  the  displacement  field  is  done  by  solving  a  set 
of  simultaneous  equations  which  requires  rigidity  of  the  object, 
i.e.,  constancy  of  length  and  angle.  Unfortunately,  these  equar 
tions  are  usually  non-linear  equations  of  a  complicated  form, 
so  that  their  solution  is  given  only  by  numerical  schemes  [23, 
25.  29,  30).  The  analysis  becomes  simpler  to  some  extent  for 


•Piuuant  Minis:  Dtpwtnxnt  of  Compitff  0«n>»  Klrj». 

Q«am*  370,  JftpM 


optical  Hows  resulting  from  infinitesimal  motion  (9,  31],  and 
exact  analytical  solutions  have  been  obtained  in  closed  forms 
when  the  object  is  a  planar  surface  [18,  19,  24,  28]. 

The  most  serious  drawback  of  this  type  of  analysis, 
whether  for  the  displacement  field  or  for  the  optical  flow,  lies 
in  the  difficulty  of  finding  the  exact  pom  Ho- point  correspon¬ 
dence.  Many  techniques  of  detecting  the  displacement  field  or 
optical  flow  have  been  proposed  [1,  6  -  9,  11,  12,  21,  26,  30], 
but  they  consume  considerable  computation  time.  Correspon¬ 
dence  cannot  be  obtained  in  regions  where  occlusion  takes 
place.  Even  if  no  occlusion  is  involved,  existence  of  noise  or 
misdetection  of  correspondence  is  inevitable.  All  the  above 
mentioned  analyses  are  known  to  be  very  vulnerable  to  noise 
(cf.  Adiv  [2],  Tsai  and  Huang  [29]).  Therefore,  schemes  of  3D 
recovery  which  do  not  require  correspondence  are  desired. 

There  exist  methods  of  computing  the  optical  flow  that 
do  not  require  detection  of  feature  points  or  point-to-point 
correspondence  when  planar  regions  are  identified.  Waxman 
and  Wohn  [32]  observed  time  varying  contour  images.  Kana¬ 
tani  [13  -  17)  computed  a  set  of  numerical  features  (which  are 
also  called  properties  in  Rosenfeld  and  Kak  [26]),  measuring 
characteristics  of  the  shape,  such  as  area,  contour  length,  cen¬ 
troid  and  moments  for  a  planar  region  in  each  frame.  How¬ 
ever,  since  these  methods  are  based  on  optical  flow,  they  can 
be  applied  only  to  infinitesimal  motions. 

After  many  attempts,  researchers  today  seem  to  have 
agreed  that  simultaneous  detection  of  both  structure  and 
motion  from  an  image  sequence  based  on  point-to-point 
correspondence  is  extremely  difficult  and  hence  impractical. 
Now,  attempts  are  being  made  to  determine  the  structure  and 
motion  separately.  For  example,  Lin  et  al.  [22]  proposed  to 
use  stereo  to  determine  the  position  of  the  object  in  each 
frame  and  then  compute  the  motions  that  have  taken  place 
from  frame  to  frame.  Computation  is  based  on  several  feature 
points  taken  from  the  object  image.  According  to  their 
scheme,  detection  of  point-to-point  correspondence  is  necessary 
for  stereo  matching,  but  no  correspondence  is  required  between 
image  frames  taken  at  different  times.  Aloimonos  and  Basu  [3] 
also  proposed  to  use  two  cameras  and  feature  points  taken 
from  the  object  image.  They  showed  that  even  the  point-to- 
point  correspondence  for  stereo  matching  is  not  necessary  if 
the  object  is  a  planar  surface. 

Another  issue  is  the  ambiguity  of  the  interpretation; 
there  exist  multiple  solutions  which  correspond  to  the  same 
displacement  field  or  optical  flow.  One  way  to  avoid  this 
indeterminacy  is  to  trace  the  motion  starting  from  a  known 
initial  position  detected  by  some  other  means  -  range  sensor  or 
stereo,  for  example.  This  idea  combined  with  the  correspon¬ 
denceless  approach  was  also  proposed  by  Kanatani  [13  -  17]  for 


704 


*  sequence  of  infinitesimally  small  motions. 

In  this  paper,  we  propose  a  new  scheme  to  trace  the 
motion  of  an  object  from  a  known  initial  position  without 
using  point-to-point  correspondence.  Furthermore,  the 
motions  need  not  be  infinitesimally  small.  The  time  interval 
between  successive  frames  can  be  arbitrarily  long  as  long  as 
the  object  of  interest  is  always  seen.  Here,  we  assume  that  a 
planar  face  of  the  object  is  identified.  The  motion  is  computed 
by  measuring  numerical  “features”  characterizing  the  shape  of 
the  planar  region. 

This  type  of  approach  was  first  tried  by  Cyganski  and 
Orr  [10]  under  restricted  situations.  They  assumed  that  the 
projection  is  orthographic  and  that  the  initial  frame  gives  the 
image  of  a  planar  region  parallel  to  the  image  plane.  They 
computed  various  moments  of  the  planar  region  before  and 
after  the  motion  and  reconstructed  the  “affine”  transformation 
resulting  from  the  motion,  so  that  information  about  depth 
could  not  be  obtained.  In  tuia  paper,  on  the  other  hand,  we 
assume  perspective  projection  and  the  position  of  the  object  is 
arbitrary  in  the  first  frame. 

First,  given  two  images  of  successive  frames,  a  heuristic 
“guess”  about  the  motion  is  computed,  and  one  image  is 
transformed  according  to  this  estimated  motion  so  that  it  is 
positioned  close  to  the  other  image.  Then,  the  motion  that 
accounts  for  the  remaining  small  discrepancy  is  estimated  by 
measuring  numerical  “features”  characterizing  the  shape  of  the 
planar  region.  Estimation  is  done  by  solving  a  set  of  simul¬ 
taneous  linear  equations  based  on  the  equation  of  optical  flow 
developed  by  Kanatani  [16,  17).  This  process  is  iterated;  after 
each  estimate  of  the  motion,  one  image  is  transformed  accord¬ 
ing  to  the  estimated  motion  so  that  it  is  positioned  closer  and 
closer  to  the  target  image. 

The  application  of  our  procedure  is  not  limited  to  motion 
detection.  Evidently,  the  same  procedure  can  be  used  to 
determine  the  orientation  and  position  of  a  stationary  object 
in  the  scene  when  the  shape  of  the  object  is  known  and  stored 
in  a  data  base.  Since  the  shape  is  known,  we  can  generate,  say 
by  a  computer  graphics  technique,  the  projected  image  of  the 
object  when  placed  in  an  appropriate  position  and  orientation. 
We  can  locate  the  object  by  computing  the  “motion”  between 
this  reference  image  and  the  observed  image. 

Various  practical  issues  which  become  relevant  in  imple¬ 
mentation  are  discussed,  including  the  choice  of  features,  con¬ 
strained  motions  and  face  identification.  We  also  discuss  a 
method  of  computing  features  without  actually  transforming 
images  by  introducing  the  “conjugate  feature  transformation”. 
This  method  makes  it  possible  to  do  all  the  computations  on 
the  initially  given  images  alone. 

We  give  some  numerical  examples.  We  also  test  noise 
sensitivity.  Our  scheme  turns  out  to  work  well  for  a  very 
large  motion,  and  the  computation  is  very  robust  to  noise 
added  to  the  images  and  input  data. 

a.  DISPLACEMENT  FIELD  AND  OPTICAL  FLOW 

3.1  Geometry  of  Perspective  Projection 

Consider  an  zy* -coordinate  system  fixed  in  the  scene,  and 
regard  the  zy-plane  as  the  image  plane.  We  regard  point 
(0,0,-/)  on  the  negative  side  of  the  r-axis  at  distance  /  from 
the  zy-plane  as  the  viewpoint  or  camera  focus,  and  call  /  the 
focal  length.  A  point  in  the  scene  is  projected  to  the  intersec- 
I  tion  of  the  zy-plane  with  the  ray  connecting  the  point  and  the 


viewpoint  (Fig.  1).  Hence,  a  point  (X,Y,Z)  is  projected  to 
point  (z,y)  on  the  image  plane,  where 


*=-£L, 

f+Z 


,=-DL. 


This  perspective  projection  reduces  to  orthographic  projection 
in  the  limit  of  /— * oo. 

Consider  a  planar  surface  5  in  the  scene,  and  let 
z=px+qy+r  be  its  equation.  We  call  p,  q ,  r  the  surface 
parameters  of  the  plane.  The  3D  position  of  a  point  is 
recovered  from  its  image  if  the  point  is  known  to  be  on  the 
surface.  Namely,  point  (z,y)  is  back-projected  to  point 
(X.Y.Z),  where 

U+r)*  y-  V+r)s  Z-f(P*+W+T)  (2.2) 

f-px-qy '  f-px-qy’  f-pz-qy 

2.2  Finite  Motion  and  Displacement  Field 

Consider  another  planar  surface  S'  in  the  scene.  Suppose 
we  observe  surface  S  at  time  t  and  surface  S'  at  time  t'  and 
we  are  required  to  determine  the  motion  in  between.  In  order 
to  specify  the  rigid  motion,  choose  an  arbitrary  reference  point 
( X0,Y0,Z0 )  on  the  surface  S .  For  example,  if  the  surface  5  is 
bounded,  we  may  choose  an  arbitrary  point  (z0,y0)  inside  the 
bounded  region  and  backproject  it  onto  the  surface  S  by  eqns 
(2.2).  Then,  the  motion  is  specified  by  the  translation  (a,t,c) 
of  the  reference  point  and  the  rotation  /?=( r,y )  around  it  (Fig. 
2).  We  call  ru,  i,j= 1,2,3,  and  a,  b,  c  the  motion  parameters. 
This  motion  maps  point  (X,Y,Z)  on  the  surface  to  (X1,  Y',Z'), 
where 

[X'l  pol  ra-|  p'-ATo'l 

M-tfrbnEgj  (2s) 

It  is  not  difficult  to  prove 

Proposition  1.  Surface  x=px  +  qy  +  r  is  mapped  to  surface 
z=p'x+q'y+r'  by  the  motion  (2.3),  where 

,  r11p+r12v-r13  ’■2iP  +  r229_r23 


,  rllP+r129->‘l3 

P  — - - - . 

r3lP+r329-r33 


r3lP+r329_r33 


r' — (Zg+c  )-(Xo+a  )p'~{Yo+b  )q'. 


If  we  substitute  eqns  (2.2)  in  the  right-hand  side  of  eqn 
(2.3)  and  put  x'=fX'/(f+Z')  and  y'=fY'/(f+Z'),  we  obtain 

Proposition  2  (Displacement  Field).  Point  (z,y)  on  the  image 
plane  is  mapped  to  point  (z',y')  by  the  motion  (2.3)  according 
to 

jt.  jAuZ+Awy+Aixf  ^.A21z+AMy+A23/' 

A$iX +A3%y -t-A^zf  A^x+Awy 

where 

A  n=~J~(P (Xo-+a)-{pX0+f+r  )r n~pY0r i2"P(^o+/)ris). 

A  i2^--j~(q(X0+a)-qX0r u-il^o+f +r  )r  iz~?(£o+/)ris). 

A  13s  j\_r^0+a~r  n-*Vr  12  Yg-r  \^Zrr )), 

A  (P  ( ^0+^  MP*o+/+r  )r  2i~P^or  vrP  (Zo+f)r  2s)t 


705 


A  2J=- ~j( q(Ya+b)-qXQr  12r(‘)^0+r  +/)r4S_?(^0+/)r23)(2.6) 

A  y°+6  -rsiA'0-r»  Yo-rMo-r )). 

A3i=-J^(p(Zo+c+fHpXo+f+r)r3i-pYor3t-p(Zo+f)r33), 

A  3i=-yi— (  q(Z0+c  +f)-qX0r  n-(qY0-<rr  +f)r  i2-q(Z0+f)r33), 

A33=  ~^—(Z0+c  +f+r-r3lX(rr32Y0-r33(Z0-r  )-r ). 

Transformation  (2.5)  is  regarded  as  a  2D  projective 
transformation,  and  is  expressed  as  a  linear  transformation  if 
homogeneous  coordinates  are  used,  though  the  form  of  eqns 
(2.5)  is  more  convenient  for  our  purpose.  One  important 
consequence  is  the  fact  that  the  matrix  A=(Ay)  induces  a 
homomorphism  from  the  transformation  group  of  the  form  of 
eqns  (2.5)  into  the  group  of  matrices  under  multiplication. 
Namely,  if  the  image  undergoes  a  transformation  specified  by 
matrix  A  followed  by  another  transformation  specified  by 
matrix  A',  the  composite  transformation  is  specified  by  matrix 
A"=AA'.  Thus,  the  inverse  transformation  of  eqns  (2.5)  is 
given  by  replacing  matrix  A  by  its  inverse  A"1.  Let  us  call 
parameters  Ai; ,  ij  =1,2,3,  the  transformation  parameters. 


2.3  Infinitesimal  Motion  and  Optical  Flow 


As  is  well  known,  a  3D  rotation  is  specified  by  the  rota¬ 
tion  axis  (nl,n2,n3),  which  is  taken  to  be  a  unit  vector,  and 
the  rotation  angle  fi  (rad)  screwwise  around  it.  The 
corresponding  rotation  matrix  is  given  by 


cosn+(l-coen)ni2  (l-eosft)n  ln3-^\a(lnt  (l-coaf})n  ,na+sinnn2 
(l-cosO)n3n  ,+sinOng  coef]+(l-coeft)n33  (l-cofln)n2nr^innn , 
( l-cosfi)n#n  ,-si  lit]  n2  (l-cosn)n8n  3+sinOn  ,  ro»U  *(1  cofl[J)rij2 


(2.7) 


Consider  as  a  special  case  an  infinitesimal  motion.  If  the 
rotation  angle  0  is  close  to  zero,  the  rotation  is  close  to  the 
identity.  Then,  matrix  R  of  eqn  (2.7)  takes  the  form 
R — I+AR,  where  /  is  the  unit  matrix  and 


Here,  we  put  0,=Qni,  02=nn2  and  n3=nn3,  and  ... 
denotes  higher  order  terms  in  0.  This  infinitesimal  rotation  is 
regarded  as  a  rotation  around  an  axis  whose  orientation  is 

(n„n2,n3)  by  angle  ^fV+fi^+n^  (rad)  screwwise.  Thus,  o, 
6,  c,  n„  n2,  n3  are  the  motion  parameters  of  infinitesimal 
motion.  Then,  Propositions  1  and  2  respectively  become 


i -i+Az,  y'=y+Ay  and 

Az=u0+Az+By-t-(£z+Fy)z-i-  •  •  ■  , 

(2.10) 

Ay=v0+Cz+Dy+(Ez+Fy)y  +  •  , 

where  ...  denotes  higher  order  terms  in  a,  fc,  c,  fi!,  fi2,  fl3, 
and 

u0=—^—(a-(Z0-r)n2+Y(fl3),  vQ=—^—(b+(Zrr  )nj-Xofi3). 

J+r  j  +  r 

A  = — — (pa  +c-Y  ofl  i+(  Xg-p  ( Z  o+/))02+p  Y  0fl3), 

J+r 

B=-  *  (qa-q(Z0+f)Q2+(qY0+f+r  )fl3), 

J+r 

C=-  ^4pb  +p(Z0+f  )Ul-(pX(1+f  +r  )n3),  (2.11) 

D  =-—^—(qb+c-(Y0-q(Z0+f))Ql+X0n2-qX0n3)l 

f+r 

E^j^^j(Pc  -pY0Yli+(pX0+f+r)Cl2), 

f — j[f\y)(qc  ~(qYo+f +r  )ni+?AT0n2). 

The  eight  parameters  u0,  v0,  A,  B,  C ,  D ,  E ,  F  are  called  the 
flow  parameters. 

2.4  3D  Recovery  from  Correspondence 

If  an  exact  correspondence  is  given  for  at  least  four  points 
on  the  image  plane,  we  can  obtain  the  transformation  parame¬ 
ters  of  eqns  (2.5)  or  the  flow  parameters  of  eqns  (2.10)  by 
fitting  these  equations  to  the  observed  correspondence  pairs. 
(For  eqns  (2.5),  only  the  ratio  An  :  Al2  :  A13  :  A21  :  An  :  As 
:  A31  :  A32  :  AB  is  determined.)  Then,  the  surface  parameters 
p ,  q ,  r  and  and  the  motion  parameters  a,  6 ,  c ,  (or  Qj,  f)2, 
n3)  are  obtained  by  solving  eqns  (2.6)  or  (2.11).  However,  the 
solution  is  in  general  not  unique.  For  eqns  (2.11),  a  complete 
solution  is  available  in  analytical  closed  form  (cf.  Kanatani 
j  18]).  If  the  surface  parameters  p,  q,  r  are  assumed  to  be 
known,  eqns  (2.6)  or  (2.11)  are  linear  in  the  motion  parameters 
a,  6,  c,  rfj  (or  flj,  fl2,  fl3),  so  that  they  are  determined  by 
solving  a  set  of  linear  equations.  However,  as  was  discussed  in 
the  previous  section,  an  exact  correspondence  is  difficult  to 
find,  and  this  type  of  analysis  is  known  to  be  very  sensitive  to 
noise  and  mismatch  of  the  point-to-point  correspondence. 


3.  MOTION  ESTIMATION  BY  FEATURE 
DETECTION 


Proposition  3.  Surface  z=pz+qy+r  is  mapped  by  an 
infinitesimal  motion  to  surface  i—p'x+q'y+r',  where 
p'—p+Ap,  q'—q+Aq,  r'=r+Ar  and 

Ar=MO,-(g,+i)n*-fn,+  "  •  A»=(t'+i)nl-Mn4+i>n,+ 

(2.9) 

Ar  =-p«-?4  +c-(l,o+f(2crr))ni+{AM>+l>(2o-r))nt+(»J(’<rP7’oXl«+ 

(Here,  .|. .  denotes  higher  order  terms  in  a,  6,  c,  flj,  fi2,  fl3.) 

Proposition  4  (Optical  Flow).  Point  (z,y)  on  the  image  plane 
is  mapped  by  an  infinitesimal  motion  to  point  (z'.y1)  by 


3.1  Features  of  Images 

Let  X(x,y)  represent  the  image.  For  example,  if  the 
image  consists  of  gray  levels,  the  value  of  X(z,y)  designates 
the  image  intensity  at  point  (z,y).  If  the  image  consists  of 
colors,  X(z,y )  may  be  a  vector  valued  function,  corresponding 
to  R,  G,  B.  If  the  image  consists  of  points  and  lines,  function 
X(z,y)  has  delta-function-like  singularities.  In  any  case,  we 
define  a  feature  of  image  Y(z,y)  as  a  functional ,  i.e. ,  a  map 
Fpf]  from  a  set  of  (appropriately  restricted)  images  X(z,y)  to 
the  set  R  of  real  numbers.  (This  is  also  called  a  property  in 


706 


Roeenfeld  and  Kak  [25].) 

The  basic  principle  of  our  correspondenceless  approach  is 
as  follows.  We  observe  an  appropriate  set  of  features  /ipf], 
•=1.2,3,  •  •  ,  of  image  X(*,y)  in  one  frame  and  image 
X'(x,y)  in  the  next  frame  and  reconstruct  (hopefully)  the 
motion  of  the  object  from  the  numerical  values  of  F,  [X]  and 
fjpf'l,  f  ==1,2,3,  •  ■  •  .  Thus,  there  is  no  need  to  know  the 
point-to-point  correspondence.  It  is,  however,  very  difficult  to 
obtain  simple  equations  to  determine  the  motion  in  terms  of 
the  feature  values  for  a  general  object.  Here,  for  simplicity,  we 
assume  that  planar  parts  of  the  object  surface  are  identified  on 
the  image  plane  and  that  region-to-region  correspondence  is 
known.  (This  issue  is  discussed  later.) 

3.2  Infinitesimal  Motion 

Let  us  first  consider  infinitesimal  motion.  Let  F[X]  be 
the  value  of  a  feature  of  a  planar  region,  i.e.,  the  region 
corresponding  to  a  planar  face  of  the  object,  in  one  frame  and 
7’’(X']  be  the  value  of  the  same  feature  of  the  region  in  the 
other  frame.  If  the  feature  functional  F(  .  ]  is  smooth  (with 
respect  to  an  appropriate  topology),  the  value  of  /’[AT']  is  also 
infinitesimally  different  from  /’[AT],  i.e.,  the  difference 

A/’(X]e=/'[X']-/'[X]  (3.1) 

is  infinitesimally  small.  The  planar  region  under  consideration 
undergoes  a  small  change  according  to  the  optical  flow  of  eqns 
(2.10).  The  optical  flow  is  linear  in  the  flow  parameters  u0,  v0, 
A,  B ,  C,  D,  E,  F  and  is  zero  if  these  flow  parameters  are 
zero.  Hence,  unless  the  feature  functional  /"[  .  ]  is  pathologi¬ 
cal,  the  difference  A.F[X]  is  expanded  into  Taylor  series 

A/’[X|=C'.o[X]u0+C„o[X]u04-C/1  [X]A  +Cb[X]B+CcIX]C 

+Cd(X\D+Ce{X\E+Cf[X\F+  ■  •  •  ,  (3.2) 

where  . . .  denotes  higher  order  terms  in  the  flow  parameters 
“o.  «'o.  A,  B,  C ,  D ,  E ,  F . 

An  important  fact  is  that  C„([  ],  C,J  .  ],  CA  (  .  ),  CB\  .  ], 
Cel  •  ],  CD\.\,  CB\  .  ],  CF\  .  ]  are  functionals  that  are  derived 
from  the  given  functional  /■[  .  ]  independent  of  the  image  and 
hence  they  are  known  functionals  given  beforehand. 

If  we  substitute  eqns  (2.11)  in  eqn  (3.2),  we  obtain 

Af[A]=C.|A-|«  +  C1[y|6+Ct|Xc+COl[Jr|01+C^|A-]tVC(^[A-ln,+(3:3), 

where 

C.  {X}^J^UC.0{X\^CA  [X]-?C*  [X]), 

C„  \X\^-±-{/CVolX\-pCc {X]-qcD  [X]), 


co,[x]^j^:U(  y0C'.0[x]-x0c,jx])-Py0c/(  [x] 


-(qY0+f+r )CB  [X]+(pX0+/+r  )Cc  \X]+qX0CD  [X]). 


If  the  surface  parameters  p,  q,  r  are  known,  C,  [  .  ],  C*[  .  ], 
<?.[.].  CaJ  .  ],  CnJ  .  ],  C(fJ  .  j  are  known  functionals,  and 
hence  the  values  of  C.[X],  C,[X],  Ce[X\,  CnJX],  C^JX], 
CfjJX]  are  computed  for  image  X(z,y ).  Thus,  if  the  difference 
A/’[X)  =  /’[X']  -  /’[X]  is  observed  from  the  two  images 
X(x,y )  and  X'(z,y ),  eqn  (3.3)  gives  a  linear  constraint  on  the 
motion  parameters  a,  6,  c,  flj,  fl2,  fl3  (neglecting  higher  order 
terms).  This  means  that  if  at  least  six  independent  feature 
functionals  /^[X],  « =1,2,  •  ■  •  ,  are  provided,  a  set  of  linear 
equations  is  available  to  determine  the  motion  parameters  a, 
b,  c,  n„  n2,  n3  in  the  form 


C,(‘)[X]  C»(‘)[X]  Cc<»[X]  C//I  [X]  C^IX)  C$  [X] 
C.(2)[X]  C/2>[X]  Cc<2>[X|  c£>[X]  C$\X\  C$  [X] 


6 

c 

n. 

n2 

n3 


X'] 

X' 


F">X 

FWx 


(3.5) 


or  the  least  square  method  can  be  applied  to  more  than  six 
equations. 


Thus,  given  the  surface  parameters  p,  q,  r  for  the  initial 
frame,  the  motion  parameters  a ,  b ,  c ,  fij,  fl2,  fi3  of  the 
motion  between  the  initial  and  the  next  frames  are  obtained 
without  using  point-to-point  correspondence.  Then,  the  sur¬ 
face  parameters  p',  q',  r'  in  the  next  frame  are  approximately 
determined  by  eqns  (2.9).  This  process  can  be  repeated,  and 
the  motion  of  the  surface  is  traced  if  we  start  from  a  known 
initial  position  of  the  surface.  This  is  a  special  case  of  the 
principle  proposed  by  Kanatani  [15,16],  He  also  studied  in 
detail  various  types  of  possible  features. 


3.3  Finite  Motion 

The  method  described  above  is  based  on  optical  flow  and 
hence  is  a  first  order  approximation  with  respect  to  the  motion 
parameters.  Therefore,  errors  accumulate  unless  the  time 
interval  between  successive  frames  is  very  short.  On  the  other 
hand,  if  the  time  interval  is  too  short,  the  differences  A/’(,)[X] 
are  very  small  and  are  likely  to  be  buried  in  numerical  errors. 
In  the  following,  we  consider  a  scheme  applicable  to  finite 
motions. 


\X\+Co  [*]^y(pC£[Xl+yC>(X])) 
cal\X\m-jL~(f(Zlrr)C,JiX\+  Y0CA  (X]-p (J+Z0)Cc [X| 

+( Konf (!+Z<,))CB[X\)-lj(rY<>Ct |Jf)+(f JW+r)C,|Ar])),  (3.4) 
<-'aJX}m——-{/{Z0-r)C.J[X)+{Xo-plf+Zo))CA  [X]-f  (/+Z0)Cj[Xj 
+^l^l-7((P^o+/+'-)^[X]+^0Cf[X])). 


The  basic  principle  we  adopt  is  to  use  the  method 
described  above  to  obtain  an  initial  approximation  and  to 
transform  the  image  according  to  the  estimated  motion.  This 
process  is  repeated  until  the  two  images  sufficiently  overlap. 
Let  p,  y,  r  be  the  initially  known  surface  parameters,  and  let 
a,  b,  c,  Oj,  Q2,  fl3  be  the  estimated  motion  parameters.  Let 
R  be  the  rotation  matrix  computed  by  eqn  (2.7)  by  putting 
n=vni+n7+n7,  "i=n,/n,  n2=n2/n,  n3= n3/n.  if  the 

surface  undergoes  the  motion  of  eqn  (2.3),  the  new  surface 
parameters  p,  q,  f  are  given  by  eqns  (2.4).  The  image  is 
transformed  according  to  the  displacement  field  of  eqns  (2.5). 

Suppose  image  X(x,y)  is  transformed  by  the  displace- 


707 


ment  field  of  eqns  (2.5)  into  image  X(x,y).  Since  we  know  the 
surface  parameters  p ,  q ,  r ,  we  can  repeat  the  same  process  by 
measuring  the  features  of  image  X(x,y).  (There  exists  a 
method  to  compute  these  features  from  the  original  iamge 
X(x,y)  without  actually  doing  image  transformation;  cf. 
Appendix  A.)  At  this  stage,  we  compute  the  motion  with 
respect  to  the  new  reference  point  (X0,  Y0l Z0),  where 


X0 — ■Xo_b°>  Zq=Zq+c.  (3.8) 


Thus,  the  new  motion  parameters  a ,  6 ,  c ,  Oj,  02,  03  of  the 
next  step  are  given  by  solving 


C/»[M  Ci'>[X|  C^[X]  C$[X\  C$\X] 
C.(2'[X]  C/5>M  C.»[*|  <?#>(*]  C$\X\  C$[X\ 


b 

c 

n2 

n3 


p(i)[x'!-F(,)[xn 

F(2)MK(2)[*I 


(3.7) 


Now,  we  obtain  the  rotation  matrix  R  by  eqn  (2.7)  by 
putting  n^SV+fV+iV  an<^  substituting  fl=n, 
«!— n,/n,  n^Qt/n,  n3=fi3/0  in  it.  The  composition  of 
two  motions  specified  by  motion  parameters  rtj ,  a,  b,  c  and 
r ij ,  a  ,  b ,  c  is  given  as  follows: 

Proposition  5  (Rule  of  Composition).  The  composition  of  a 
motion  specified  by  motion  parameters  o,  6,  c,  fi=(r„)  with 
respect  to  reference  point  (X0,  Y0,Z0)  followed  by  a  motion 
specified  by  motion  parameters  a,  b,  c,  R—^j),  with  respect 
to  reference  point  (A”0,  T0,^0)  is  equivalent  to  a  motion 
specified  by  motion  parameters  a',  b',  c',  R'=(r,j'),  with 
respect  to  reference  point  (X0,  Y0,Z0)  defined  as  follows  (Fig. 

3): 

«-«+«,  b'—b+b,  c'=c+c,  Rf—RR.  (3.8) 


teristics  and  digital  image  processing.  Hence,  it  is  desirable  to 
use  as  features  more  stable  characteristics  such  as  marked 
feature  points,  line  segments,  edges  and  contours. 


4.2  Dot  Features 

If  we  can  identify  specific  dots  in  the  region  S  of  the 
image  plane  under  consideration,  we  can  define  a  feature  as 
the  summation 


F[X\=  E  *»(*<,»<),  (4  1) 

P,tS  v  ’ 

of  the  values  of  an  arbitrary  weight  function  m(z,y)  over  the 
position  P(xf ,yf)  of  the  dots.  This  is  possible,  for  example,  if 
the  object  surface  is  textured  with  dots  which  are  clearly 
visible  and  identifiable  in  each  frame.  However,  a  one-to-one 
correspondence  between  the  dots  is  not  necessary.  This  type 
of  feature  was  used  by  Kanatani  and  Chou  [20]  in  the  “shape 
from  texture”  problem.  Another  possibility  is  to  use  charac¬ 
teristic  points  of  contours.  If  the  planar  region  in  question  is  a 
polygonal  face  of  an  object,  the  corner  points  can  be  used  as 
stable  feature  points. 

If  each  feature  point  (z,-,y,)  is  displaced  by  (Ax,  ,Ay,),  the 
value  of  F[X)  changes  by 

±nX)=pj%(*i,y<)Axi+2jl(xi,yi)Ayi+  ■  ■  J,  (4.2) 

where  . . .  denotes  higher  order  terms  in  Ax,  Ay.  Substituting 
the  equations  of  optical  flow  (2.10),  we  find  the  functionals 

C.JX),  cum,  cB{X],  cc\x\,  CD[X],  CE[X\,  CUM 

in  the  form 


<W-E  %(*.*). 

P,lS  ox 


ptts  °v 


CA  M=  E  W\-  £  s fc^W), 


dm. 


P.cS 


dx 


dm 


P,tS 


dx 


P.eS 


dy 


aj  a  ' 

p,ts  °y 


Using  r„ a1,  b',  c '  as  improved  estimates  of  the  motion 
parameters,  we  repeat  the  whole  procedure  until  all  features 
F*‘*M  become  sufficiently  close  to  F*‘*M]  °f  the  next  frame. 


4.  FEATURES  OF  PLANAR  REGIONS 

4.1  Consistency  of  Features 

A  good  choice  of  features  is  essential  to  the  present 
scheme.  It  is  crucial  that  the  features  be  consistent  in  the 
sense  that  they  must  faithfully  respond  to  image  distortions 
due  to  3D  motion  alone  and  should  not  be  affected  by  other 
factors  such  as  reflectance  changes.  Originally,  characteriza¬ 
tion  of  an  image  by  “features”  (or  “properties”)  was  proposed 
for  gray-level  images  (Amari  [4,  5),  Rosenfeld  and  Kak  [27],  cf. 
Appendix  B).  For  gray-level  images,  however,  we  must  impose 
the  consistency  requirement  that  the  gray  levels  are  inherent 
to  the  object  itself  and  that  they  are  not  affected  by  viewing 
angle  and  reflectance  changes.  This  assumption  is  sometimes 
satisfied,  e.g.,  when  chromaticity  is  measured.  However,  in 
most  cases,  the  absolute  values  of  gray  levels  do  not  neces¬ 
sarily  reflect  only  the  physical  properties  of  the  object,  but  are 
affected  by  numerous  factors  such  as  lighting,  camera  charac- 


dm 


dm, 


cE  M=  E  k  2^k  ,y,  )+*,•  y.  ^-k.y, )], 


P,tS 


dy 


cF  M=  E  k  Vi  -Vi  )+y,2^-k  ,y,  )!• 


P,fS 


Thus,  if  we  take  six  independent  functions  m,(x,y), 
i  =  l,  .  .  .  ,  6,  we  obtain  the  corresponding  features  F^M  in 
the  form  of  eqn  (4.1),  and  Ct<’>[M,  C,(’>[X],  C^M,  c(}}  [AT], 

C$\X ],  C^M.  •=! . 6,  are  given  by  eqns  (3.4). 

Hence,  the  motion  can  be  recovered  by  the  procedure  described 
in  the  previous  section. 


4.3  Line  Features 

If  we  can  identify  specific  line  segments  in  the  planar 
region  S  under  consideration,  we  can  define  a  feature  func¬ 
tional  as  the  sum  of  the  line  integrals 

F\X}=  E  SL  m(*,y)da  (4.4) 

X-.CS 

of  an  arbitrary  weight  function  m(x,y)  along  each  individual 
line  segment  Lt  in  the  planar  region  5.  Again,  this  type  of 
functional  can  be  applied  if  the  surface  is  textured  by  clearly 


708 


visible  and  identifiable  line  segments  (cf.  Kanatani  and  Chou 
[20]),  and  no  knowledge  of  the  correspondence  between  line 
segments  is  necessary.  Alternatively,  we  can  use  a  single 
bounded  contour  in  the  region  as  a  stable  feature  (cf.  Waxman 
and  Wohn  [32],  Kanatani  [15]). 

It  follows  from  elementary  calculus  that  if  each  point 
(x,y)  on  each  line  segment  L,-  is  displaced  by  (Ax, Ay),  the 
value  of  F[X]  changes  by 


+{ni'°g.+nM2gL+°§JL)+n,*°§JL)m\is  +  ■  ■  (4.5) 


a t 


dz 


at 


where  *  and  (n1,n2)  are  the  arc  length  and  unit  tangent  vector 
respectively  along  the  line  segment  Lj .  Here  . . .  denotes  higher 
order  terms  in  Ax,  Ay  and  their  derivatives.  Substituting  the 
equations  of  optical  flow  (2.10),  we  find  the  functionals  CUo[X\, 
C.JX],  CA[X],  CB[X\,  Cc[X  1,  CD[X\ ,  C£[X],  CF\X]  in  the 
form 


ct  X'=  £  <3.1*1-  E 

L,CS  ’  ax  LtC  S  ' 

Cc\X\-  E  Sl  \z^~"¥n^nzm\d»,  °d\X\=  E  II  l»-^?-+n2Sml<^4.6) 

C£[X]=  E  /,  [x2^-+xy^.+(2xn1;!+yn1n;!+xn22)m]<fs, 
LtcsLx 


CF\X\=  £  /.  [xy^-+y2^+(yn,2+xn1„2+2y1l22)m!da, 

t,C5  1  ^ 

Thus,  if  we  take  six  independent  functions  m,(x,y), 
i=l,  ...  ,6,  we  obtain  the  corresponding  features  in 

the  form  of  eqn  (4.4),  and  Cj‘*[X],  C/'^A],  C/'^X],  C({^\X], 
C$\X\,  C^[X],  1=1,  .  .  .  ,  6,  are  given  by  eqns  (3.4). 
Hence,  the  motion  is  recovered  by  the  procedure  described  in 
the  previous  section. 


4.4  AREA  FEATURES 

If  a  bounding  contour  of  a  planar  face  is  observed,  we  can 
also  define  a  feature  functional  as  the  area  integral 


F  lxl=//s  m  (*>*  )<**<*»’  ( 4 -7> 

of  an  arbitrary  weight  function  m(x,y)  over  the  region  S  (cf. 
Kanatani  [15]).  The  area  integral  can  always  be  converted 
into  a  line  integral  along  the  contour  C  of  region  5  by 
Green’s  theorem.  Namely,  we  can  always  find  two  functions 
P(x,y),  Q{x, y)  such  that 


m(x,y)= 


dQ  dP 
dx  dy  ’ 


(4.8) 


and  we  obtain 


f  fsm(x,y)dxdy=fclP(x,y)n1+Q(x,y)  n2]d»,  (4.9) 

where  #  and  (n^nj)  are  the  arc  length  and  unit  tangent  vector 
respectively  along  the  contour  C  counterclockwise. 

If  each  point  (s,y)  on  the  image  plane  undergoes  a  dis¬ 
placement  by  (Ax, Ay),  the  difference  AF[X]  of  the  feature 
value  is  expressed  either  as  an  area  integral  over  the  region  S 


or  as  a  line  integral  along  its  contour  C  in  the  form 

=Jc(n2Ax-n1Ay)mds+  •  •  ,  (410) 

where  ...  again  denotes  higher  order  terms  in  Ax,  Ay  and 
their  derivatives.  Substituting  the  equation  of  optical  flow 
(2.10),  we  find 

c4X\=ISs  ^7 izd* =/c C4X\=I!S ixi*  — /c nimd>' 

CA  [X]=J Is[m+x^j}dxdy=Jcxn2mds, 

CB[X\=S  Ssy^-dxdy=Jcyn2mds, 

cc[X\=S  Ss*~  dxdy  =-/c  xn ,  mds,  (4.11) 

CD\X\= / Is\m+y^\dxdy=-$cynxmds, 

CE[X\=J  Js\3xm+x2^-+xy^-\dxdy=Jc(x2nT-xyn1)mds, 

C£]X]= /  js\3ym+xy^-+y‘2~-}dxdy=-Jc(xyni-y'1ni)mds. 

Thus,  if  we  take  six  independent  functions  m,(x,y), 
i  =  l,  .  .  .  ,  6,  we  obtain  the  corresponding  features  F^IX]  in 
the  form  or  eqn  (4.7),  and  C/‘>(X],  C/’'>[X],  Cc<’'>[X|,  Ctf>  [X], 

C^[X\,  CfllX),  ,=1 - 6,  are  obtained  by  eqns  (3.4). 

Hence,  the  motion  is  recovered  by  the  procedure  described  in 
the  previous  section. 


5.  NUMERICAL  EXAMPLES 

5.1  Algorithm  with  Initial  Adjustment 

In  the  following,  we  consider  a  polygonal  region  S  which 
is  supposed  to  be  the  image  of  a  face  of  an  object  in  the  scene. 
As  features,  we  consider  (i)  summations  with  respect  to  the 
corner  points  of  the  contour  C  in  the  form  of  eqn  (4.1),  (ii) 
line  integrals  along  the  contour  C  in  the  form  of  eqn  (4.4),  and 
(iii)  area  integrals  over  the  region  S  in  the  form  of  eqn  (4.7) 
converted  into  line  integrals  by  Green’s  theorem,  eqn  (4.9). 

Let  the  region  5  in  the  first  frame  be  a  polygon  with 
corner  points  (xj.yj),  «  =  1,  .  .  .  ,  jV,  and  the  region  S’  in  the 
second  frame  be  a  polygon  with  corner  points  (xt- ', y,- '), 
t'=l,  .  .  .  ,N.  We  compute  the  arithmetic  average  (x0,yo)  of 
the  corner  points  in  the  first  frame  and  define  the  reference 
point  (X0,Y0,Z0)  as  its  backprojection  by  eqns  (2.2).  Simi¬ 
larly,  we  compute  (lo^yoO  for  the  next  frame.  We  define  the 
“sizes”  of  the  two  regions  S,  S'  on  the  image  plane  by 

/0=m<axN/(*i-x0)2+(y,-»o)a,  l0'=max  ■/(!,' ,“Io')2+(»i'-»o')2-  (5.1) 

We  then  define  the  “estimate  of  the  object  size”  by 

Lo={ZJ!+\)la.  (5.2) 

From  this  size  L0  and  the  size  l0'  of  the  region  S'  in  the  next 
frame,  we  can  estimate  the  position  of  the  reference  point 
(X0',Y0',Z0')  in  the  next  frame  as  follows: 


709 


Z0'=(Lo/l0'-l)t ;  X0'=(Za'/f+l)x0',  Y0'=(Z0'/f+l)y0'  (5.3) 

(Fig.  4).  We  use  the  following  motion  as  the  initial  adjust¬ 
ment: 

o= X0'-X0,  b  =  r0,  c=z0'-z0,  nI=nJ=n3=o.(5.4) 

From  the  second  iteration,  we  use  the  scheme  described 
in  Section  3.  As  the  weight  functions,  we  use 

mi(i,v)==l/i,  m3(x,y)—(r-x0)/l,  m3(x,y)=(y-y0)/l, 

mt(x,y)*=(z-z0Y/l'2,  m6(x,jr)=(x-j0)(y-yo)/<2.  m^z,y)=(y-y(sYll'1, 

(5.5) 

m;(x,y)=(x-x0)3//\  m8(x,y)=(x-Xo)2(y-yo)/fS. 

ma(x,y)=(x-x0)(!/  -y0)3/l3,  mw(x,y)=(y-y0)3/l3, 
where  /  is  another  measure  of  the  size  of  the  region  5  defined 
by 

/=max{|x,-x0|,|tf,-yo|}  (5.6) 

Since  we  obtain  ten  equations  for  six  unknowns  in  the 
form  of  eqns  (3.5)  and  (3.8),  we  use  the  least  square  method  to 
solve  them.  Since  the  computation  is  iterative,  the  weight 
functions  mj(x,y),  i  =  l,  .  .  .  ,  10  of  eqns  (5.5)  and  the  size  /  of 

(5.6)  are  revised  at  each  iteration  step.  (We  use  size  I  of  eqn 

(5.6)  instead  of  size  /0  of  eqns  (5.1)  simply  because  size  1  is 
computed  faster  than  size  Z0.)  In  the  following,  we  use  a  scale 
of  length  such  that  /— 1.  The  length  of  // 2  is  indicated  in 
each  figure. 

5.2  Simultaneous  Computation  of  Translation  and 
Rotation 

Fig.  5  shows  an  initial  region  S  and  the  corresponding 
region  S'  after  motion.  The  surface  parameters  (p,q,r)  of  S 
and  S'  are  (-0.408,  0.408,  1.000)  and  (-0.875,  0.399,  2.093) 
respectively.  After  the  initial  adjustment  described  by  eqn 
(5.4),  the  estimate  of  the  surface  parameters  is  (-0.408,  0.408, 
1.972).  (The  p,  q  components  are  unchanged  because  the  sur¬ 
face  is  only  translated.)  The  subsequent  steps  are  as  follows: 
corner  features  contour  features  area  features 

1  (-0.900,0.219,2.746)  (-0.953,0.361,2.299)  (-1.101,0.358,2.390) 

2  (-0.474,0.165,1.934)  (-0.839,0.395,2.143)  (-0.823,0.368,2.080) 

3  (-1.383,0.571,2265)  (-0.877,0.399,2.093)  (-0.878,0.405,2.111) 

4  (-0.929,  0.363,  2  129)  (-0.875,  0.399,  2.093)  (-0.877,  0.399,  2.087) 

5  (-0.864,  0.390,  2.045)  (-0.875,  0.399,  2.093)  (-0.875,  0.399,  2.093) 


10  (-0.875,  0.400,  2.093)  (-0.875,  0.399,  2.093)  (-0.875,  0.399,  2.093) 
Convergence  is  very  fast  for  contour  and  area  features;  only 
four  or  five  iterations  are  sufficient.  Here  and  in  the  following, 
we  only  show  the  surface  parameters  p,  q,  r,  but  they  com¬ 
pletely  determine  the  3D  position  of  the  surface,  since  there 
exists  a  one-to-one  correspondence  between  a  point  on  the 
image  plane  and  a  point  in  the  scene  by  eqns  (2.2). 

The  iteration  process  does  not  always  converge.  From 
experiments,  we  can  conclude  that  the  depth,  i.e.,  the  distance 
of  the  surface  from  the  image  plane,  must  be  small  in  the 
second  frame.  Also,  the  range  of  the  values  that  the  motion 
parameters  can  take  is  limited.  Hence,  the  applicability  of  the 
approach  is  somewhat  restricted  if  the  six  degrees  of  freedom 
of  the  motion  are  to  be  determined  simultaneously.  However, 
the  algorithm  works  very  well  when  the  object  is  near  the 
camera  and  the  motion  is  not  too  large. 


5.3  Alternate  Computation  of  Translations  and  Rota¬ 
tions 

In  the  above,  we  tried  to  determine  the  six  motion 
parameters  at  the  same  time.  It  is  expected,  in  general,  that 
the  computation  becomes  more  stable  and  robust  as  the 
number  of  unknowns  becomes  smaller.  Here,  noting  that 
translations  and  rotations  are  very  different  types  of  motion, 
we  first  apply  the  “translational  scheme”,  which  is  the  same  as 
the  above  except  that  the  rotation  is  set  to  zero  from  the 
beginning.  After  computing  the  estimate  of  the  translation, 
the  image  is  moved  accordingly.  Next,  we  apply  the  “rota¬ 
tional  scheme”,  which  is  again  the  same  as  the  above  except 
that  the  translation  is  set  to  zero  from  the  beginning.  After 
computing  the  estimate  of  the  rotation,  the  image  is  moved 
accordingly.  This  process  is  repeated,  applying  the  transla¬ 
tional  scheme  and  the  rotational  scheme  alternately. 

Fig.  6  shows  an  initial  region  5  and  the  corresponding 
region  S'  after  motion.  The  surface  parameters  (p,q,r)  of  S 
and  S'  are  (-0.408,  0.408,  10.00)  and  (-0.694,  0.382,  22.08) 
respectively.  After  the  initial  adjustment  described  by  eqn 
(5.4),  the  estimate  of  the  surface  parameters  is  (-0.408,  0.408, 
18.89).  The  following  shows  intermediate  results  after  each 
stage,  where  one  stage  consists  of  the  translational  scheme 
iterated  six  times  and  the  rotational  scheme  iterated  six  times, 
corner  features  contour  features  area  features 

1  (-0.691,  0.379,  22.08)  (-0.827,  0.443,  22.50)  (-0.739,  0.438,  21.52) 

2  (-0.691,0.377,22.15)  (-0.779,0.407,22.57)  (-0.711,  0.436,  21.29) 

3  (-0.692,0.379,22.12)  (-0.750,0.387,22.59)  (-0.690,0.434,21.12) 

4  (-0.693,0.380,22.10)  (-0.733,0.375,22.60)  (-0.676,0.432,21.01) 

5  (-0.694,0.381,22.09)  (-0.723,0.370,22.58)  (-0.667,0.431,20.95) 


10  (-0.694,  0.382,  22.08)  (-0.706,  0.368,  22.43)  (-0.654,  0.425,  20  92) 
Convergence  is  very  fast  for  dot  features  but  is  slow  for  con¬ 
tour  and  area  features.  However,  this  method  has  a  far  wider 
range  of  applicability  compared  with  the  simultaneous  compu¬ 
tation  of  translation  and  rotation.  The  iteration  converges  for 
a  very  wide  range  of  positions,  orientations  and  motions. 

5.4  Horizontally  Constrained  Motion 

In  many  practical  situations,  motion  takes  place  on  a  hor¬ 
izontal  plane:  cars  moving  on  flat  ground,  objects  displaced  on 
a  fiat  tabletop,  etc.  Let  e  ]=(£), £2,£3),  e2—{Vi,rl2’,l3)  be  two 
mutually  orthogonal  unit  vectors  defining  a  horizontal  plane  in 
reference  to  our  camera-based  xyz -coord  in  ate  system,  and  let 
n=(n,,7i2,n3)  be  the  unit  vector  normal  to  that  plane  (Fig.  7). 
In  other  words,  the  camera  optical  axis  need  not  be  horizontal. 
Then,  a  translation  has  the  form  ae\+@Ci  and  a  rotation  is 
specified  by  rotation  angle  U  around  axis  n  at  a  prescribed 
reference  point  (X0,  Y0,Zq)-  Substituting 

o=cr  fi-bSr?,,  b=a&+0r)2,  c  =cr£3+/?rj3, 

(5.7) 

n,=nn„  n2=nn2,  n3=nn3 
in  eqn  (3.3),  we  obtain 

AF[Jf]=Ca[X!o+C(,iX]/9+Cfl[X]n+...,  (5.8) 

where 

c.w-eic.  pn+fcc,  m+^ce  in 

C,lX\=VlC.  \X]+VtCt  \X \+r,3Cc  [X],  (5.9) 

Ca(X|=niC0l|Xl+n2C0JX]-l-B3C7nt[X]. 


710 


Fig.  8  shows  an  initial  region  S  and  corresponding  region 
S'  after  motion,  where  e i=(  1 ,0,0),  «2={0,0,1),  n  =(0,1 ,0)  (i.e., 
the  camera  optical  axis  is  horizontal).  The  surface  parameters 
( p,q,r )  of  S  and  S'  are  (-0.408,  0.408,  10.0)  and  (-0.039,  0.378, 
21.1)  respectively.  After  the  initial  adjustment  described  by 
eqn  (5.4),  the  estimate  of  the  surface  parameters  is  (-0.408, 
0.408,  23.1).  The  subsequent  steps  are  as  follows: 

comer  features  contour  features  area  features 

1  (-0.152,  0.382,  22.1)  (-0.338,  0.399,  23.7)  (-0.388,  0.405,  24.2) 

2  (-0.354,0.401,23.5)  (-0.087,0.379,21.8)  (-0.239,0.389,23.1) 

3  (  0.021,  0.378,  20.7)  (-0.032,  0.378,  21.1)  (  0.066,  0.379,  20.3) 

4  (-0.107,0.380,21.8)  (-0.049,0.378,21.2)  (-0  026,0.378,21.0) 

5  (-0.049,  0.378,  21.2)  (-0.038,  0.378,  21.1)  (-0.039,  0.378,  21.1) 


10  (-0.039,0.378,21.1)  (-0.039,0.378,21.1)  (-0.039,0.378,21.1) 
Convergence  is  very  fast,  and  the  applicable  range  of  positions, 
orientations  and  motions  is  larger  than  for  unconstrained 
motion. 

We  can  again  use  the  scheme  of  alternate  translations 
and  rotations;  the  translational  scheme  determines  a  and  fl, 
and  the  rotational  scheme  determines  Q.  Fig.  9  shows  an  ini¬ 
tial  region  S  and  the  corresponding  region  S'  after  motion, 
where  « 1=(1,0,0),  e2=(0,0,l),  n  =(0,1,0)  (i.e.,  the  camera  opti¬ 
cal  axis  is  horizontal).  The  surface  parameters  (p,q,r)  of  S 
and  S'  are  (-0.408,  0.408,  10.0)  and  (0.137,  0.381,  24.1)  respec¬ 
tively.  After  the  initial  adjustment  described  by  eqn  (5.4),  the 
estimate  of  the  surface  parameters  is  (-0.408,  0.408,  31.7). 
The  following  shows  intermediate  results  after  each  stage, 
where  again  one  stage  consists  of  the  translational  scheme 
iterated  sir  times  and  the  rotational  scheme  iterated  six  times, 
com  -1  features  contour  features  area  features 

1  (0.050,0.378,26.2)  (-0.098,0.380,28.6)  (-0.184,0.384,30.2) 

2  (0.122,0.381,24.4)  (0.011,0.378,26.6)  (-0.094,0.380,28.8) 

3  (0.134,  0.381,  24.1)  (  0.069,  0.379,  25.4)  (-0.035,  0.378,  27.4) 

4  (0.136,  0.381,  24.1)  (  0.101,  0.380,  24.8)  (  0.007,  0.378,  26.6) 

5  (0.137,  0.381,  24.1)  (  0.118,  0.381,  24.4)  (  0.038,  0.378,  28.0) 


10  (0.137,0.381,24.1)  (0.136,0.381,24.1)  (0.111,0.380,24  8) 
The  convergence  is  fast  for  dot  features  but  is  slow  for  contour 
and  area  features.  However,  the  applicable  range  of  positions, 
orientations  and  motions  is  still  wider  than  for  determining  the 
three  motion  parameters  simultaneously. 

5.5  Error  Sensitivity 

The  correspondenceless  approach  based  on  feature  extrac¬ 
tion  is  in  general  expected  to  be  more  robust  than  the  use  of 
correspondence;  since  the  feature  values  are  computed  by  sum¬ 
mation  or  integration,  small  errors  in  the  positions  of  feature 
points  or  lines  tend  to  be  canceled  or  diluted  in  the  final  values 
of  the  features. 

In  the  following  example,  uniformly  distributed  random 
noise  of  10%  of  the  size  l0  and  l0'  of  the  two  regions  S,  S'  (cf. 
eqns  (5.1))  is  added  to  the  coordinate  values  of  the  corner 
points  in  the  example  of  Fig.  9.  The  following  result  is  one 
example.  After  the  initial  adjustment  described  by  eqn  (5.4), 
the  estimate  of  the  surface  parameters  is  (-0.408,  0.408,  29.7). 
The  scheme  and  interpretation  are  the  same  as  in  the  previous 
example. 

comer  feature*  contour  feature*  area  features 

1  (0.005,  0.380,  28.0)  (-0.100,  0.380,  28.2)  (-0.144,  0.382,  29.7) 

2  (0.167,  0.383,  23.9)  (-0.003,  0.378,  26.2)  (-0.044,  0.378,  27.7) 


3  (0.178,  0.384,  23.6)  (  0.042,  0.378,  25.3)  (  0.021,  0.379,  26.4) 

4  (0.179,  0.384,  23.5)  (  0.065,  0.379,  24.8)  (  0.065,  0.379,  25.5) 

5  (0.179,  0.384,  23.5)  (  0.075,  0.379,  24.6)  (  0.097,  0.380,  24.9) 


10  (0.180,  0.384,  23.5)  (  0.084,  0.379,  24.4)  (  0.166,  0.383,  23.5) 
The  results  seem  quite  reasonable.  Errors  are  about  30%  in  p 
and  1%  in  q  and  r.  We  must  say  that  the  motion  is  quite 
accurately  computed,  since  the  computed  motion  parameters 
(a,c,fl)  after  the  tenth  iteration  are  (14.9,  15.0,  -32.4)  for  dot 
features,  (14.8,  14.7,  -27.0)  for  contour  features  and  (14  9, 
14.8,  -31.6)  for  area  features.  The  rotation  angle  fl  is  meas¬ 
ured  in  degrees.  Since  the  motion  is  horizontally  constrained, 
vertical  translation  b  is  identically  zero  and  the  rotation  axis  is 
always  (0,1,0).  The  true  parameters  are  (15.0,  15.0,  -30.0),  so 
that  errors  are  about  10%  in  n  and  about  2%  in  a  and  c . 
Thus,  the  error  in  the  rotation  angle  is  magnified  to  some 
extent  when  expressed  in  terms  of  p . 

If  uniformly  distributed  10%  noise  is  further  added  to  the 
initial  values  of  the  surface  parameters  p,  q,  r,  the  result  in  a 
typical  case  is  as  follows.  The  motion  parameters  ( p,q,r )  are 
perturbed  from  the  true  values  (-0.408,  0.408,  10.0)  into 
(-0.395,  0.387,  9.9).  After  the  initial  adjustment  described  by 
eqn  (5.4),  the  estimate  of  the  surface  parameters  is  (-0.395, 
0.387,  29.2). 

comer  features  contour  features  area  features 

1  (0.088,  0.361,  25.8)  (-0.106,  0.362,  27.9)  (-0.145,  0.363,  29.3) 

2  (0.157,  0.364,  23.8)  (-0.014,  0.360,  26.1)  (-0.047,  0.360,  27.5) 

3  (0.168,  0.365,  23.5)  (  0.029,  0.360,  25.2)  (  0.016,  0.360,  26.2) 

4  (0.169,  0.365,  23.4)  (  0.049,  0.360,  24.8)  (  0.059,  0.360,  25.3) 

5  (0.170,  0.365,  23.4)  (  0.058,  0.360,  24.6)  (  0.090,  0.361,  24.7) 


10  (0.170,  0.365,  23.4)  (  0.067,  0.360,  24.5)  (  0.155,  0.364,  23.4) 
Errors  are  about  30%  in  p  and  about  5%  in  q  and  r.  The 
motion  parameters  (a,c,fi)  after  the  tenth  iteration  are  (14.8, 
14.9,  -31.1)  for  dot  features,  (14.7,  14.6,  -25.4)  for  contour 
features  and  (14.8,  14.7,  -30.4)  for  area  features,  so  that  the 
motion  parameters  themselves  are  quite  accurately  computed 
in  spite  of  the  existence  of  noise. 

Fig.  10(a)  is  an  example  of  a  sequence  of  object  images 
undergoing  a  horizontally  constrained  motion.  Again,  the 
plane  constraining  the  motion  is  parallel  to  the  rx-plane.  For 
each  image,  10%  uniformly  distributed  random  noise  of  its  size 
(cf.  eqn  (5.1))  is  added  to  the  coordinates  of  the  corner  points. 
The  true  images  are  indicated  by  dotted  line.  Fig.  10(b)  shows 
computed  surface  positions  viewed  from  above  (i.e.,  along  the 
y-axis).  A  solid  line  is  used  for  the  true  positions,  a  broken 
line  for  positions  computed  from  the  noisy  images  in  Fig. 
10(a),  and  a  dotted  line  for  positions  computed  by  adding  an 
additional  10%  noise  to  the  initial  estimates  of  the  surface 
parameters.  If  we  use  the  noiseless  images  of  Fig.  10(a),  the 
true  positions  are  obtained. 

The  computation  is  based  on  dot  features,  using  the 
scheme  of  alternate  translations  and  rotations.  The  iterations 
are  stopped  after  ten  stages.  The  initial  image  (the  left-most 
one  in  Fig.  10(a))  is  always  compared  with  the  image  at  each 
stage,  so  that  error  does  not  accumulate.  In  view  of  the  fact 
that  we  are  comparing  a  10%  noisy  image  with  another  10% 
noisy  image  (see  Fig.  10(a)),  the  recovery  is  fairy  good  and  is 
almost  unaffected  by  the  noise  in  the  initial  estimates  of  the 
surface  parameters. 

8.  DISCUSSION 


711 


6.1  Applicability  of  the  Scheme 

As  seen  in  the  examples  in  the  previous  section,  our 
scheme  can  be  applied  to  a  very  large  motion.  In  general,  con¬ 
vergence  is  fast  for  small  motions.  The  iterations  do  not  con¬ 
verge  if  the  motion  is  extremely  large  (especially  for  rotations 
through  large  angles).  However,  the  alternate  application  of 
the  translational  and  rotational  schemes  can  greatly  extend  the 
applicable  range  for  the  position  and  motion,  covering  almost 
all  motions  we  may  encounter  in  practical  situations,  although 
convergence  may  become  somewhat  slow.  The  assumption  of 
horizontally  constrained  motion  also  extends  the  applicable 
range  and  makes  the  convergence  faster.  One  general  conclu¬ 
sion  is  that  all  the  motion  parameters  can  be  determined 
simultaneously  if  the  object  is  known  to  be  near  the  camera 
(i.e.,  the  distance  to  the  object  is  not  large  compared  with  the 
focal  length).  Howerver,  the  scheme  of  alternate  translations 
and  rotations  should  be  used  if  the  object  is  known  to  be  far 
away  from  the  camera. 

6.2  Shape  and  Choice  of  Features 

Although  we  used  polygonal  regions  in  our  examples, 
application  of  the  present  scheme  is  not  limited  to  polygonal 
regions:  it  can  be  applied  to  arbitrary  shapes,  and  the  contours 
can  be  general  curves.  From  our  experiments,  however,  con¬ 
vergence  seems  to  be  better  for  simple  convex  shapes  than  for 
complicated  non-convex  shapes. 

Among  the  three  types  of  features  (dot  features,  line 
features,  area  features),  dot  features  seem  to  be  the  most 
favorable  in  convergence,  although  the  difference  is  not  very 
large.  However,  since  line  features  (especially  contours)  are 
more  stable  than  feature  points  (such  as  corner  points)  and 
line  integrals  are  computationally  easy  to  handle,  the  use  of 
line  features  seems  to  be  the  best  fit  to  practical  applications. 

6.3  Stability  of  Computation 

As  shown  in  the  example  in  the  previous  section,  the 
computation  is  very  robust  to  noise,  especially  as  regards  the 
motion  parameters.  The  example  given  by  Tsai  and  Huang 
,29]  shows  that  their  result  is  unreliable  even  at  a  I  to  2% 
noise  level.  The  stability  of  our  scheme  may  be  ascribed  to  the 
following  two  reasons.  First,  we  assume  the  initial  position 
and  compute  only  the  motion  parameters,  while  schemes  like 
that  of  Tsai  and  Huang  [29]  attempt  to  determine  the  object 
position  and  motion  simultaneously.  In  other  words,  the 
dimensionality  of  our  solution  space  is  smaller;  the  number  of 
unknowns  is  six  if  the  motion  is  unconstrained  and  is  three  if 
the  motion  is  horizontally  constrained.  Second,  we  compute 
“features"  of  a  planar  region,  which  are  macroscopic  quantities 
obtained  by  summation  or  integration.  Hence,  independent 
noise  at  different  points  on  the  image  plane  tends  to  be  can¬ 
celed. 

In  general,  there  are  many  factors  other  than  random 
noise  that  affect  motion  detection  -  illumination  change, 
reflectance  change,  occlusion,  etc.  In  view  of  these  factors,  the 
use  of  gray  levels  is  not  adequate,  as  discussed  earlier.  Even  if 
images  are  represented  by  feature  points  and  lines,  correspond¬ 
ing  feature  points  may  be  missing  in  some  frames  due  to  noise, 
illumination  change,  reflectance  change,  occlusion,  etc.  In  con¬ 
trast,  region- to- region  correspondence  is  much  more  stable. 
Our  method  is  applicable  unless  the  object  does  not  have  a 
planar  face  at  all  or  all  of  the  planar  faces  are  occluded  at  the 
same  time,  which  is  very  unlikely. 


6.4  Face  Identification 

In  our  method,  no  knowledge  of  point-to-point  correspon¬ 
dence  between  different  frames  is  necessary.  Instead,  we  must 
determine  region-to-region  correspondence.  This  is  generally 
easier  than  establishing  point-to-point  correspondence.  Sup¬ 
pose  the  object  has  planar  faces,  but  it  need  not  be  a 
polyhedron.  Take  a  car  for  example.  Our  method  is  applies 
ble  if  we  can  identify  planar  windowpanes. 

Assuming  that  the  two  given  images  are  appropriately 
segmented,  we  select  one  planar  region  from  one  image.  All  we 
need  to  do  is  determine  the  corresponding  region  in  the  other 
image.  We  can  introduce  various  "measures  of  similarity”  to 
decide  on  the  correspondence.  Even  if  they  are  unavailable, 
we  can  resort  to  a  brute  force  method.  If  there  are  N  candi¬ 
dates  for  the  corresponding  region  (usually  N  is  not  very 
large),  there  exist  N  possible  correspondences.  We  can  apply 
our  method  to  all  of  these  possibilities  to  test  for  the  existence 
of  a  motion  which  correctly  maps  one  region  onto  the  other. 

An  important  fact  is  that  only  one  region-to-region 
correspondence  is  sufficient  to  determine  the  motion  uniquely. 
If  we  use  point-to-point  correspondence,  at  least  three 
correspondence  pairs  are  necessary.  The  brute  force  method 
becomes  inefficient  in  this  case;  if  there  are  n  feature  points 
(usually  n  is  much  larger  than  N)  and  if  we  choose  three 
points  from  one  image  arbitrarily,  the  number  of  possible 
correspondences  is  n(n-l)(n-2).  Granted  that  efficient 
methods  for  point-to-point  correspondence  detection  are  avail¬ 
able,  detection  of  region-to-region  correspondence  is  much 
easier  due  to  the  more  limited  possibilities. 

6.5  Application  to  Object  Recognition 

A  major  restriction  on  the  present  method  is  that  we 
must  measure  the  initial  position  of  the  object,  say,  by  stereo 
or  range  sensing.  However,  this  restriction  also  applies  to  the 
correspondenceless  approaches  of  Lin  et  al.  [22]  and  Aloimonos 
and  Basu  [3]  using  two  cameras.  While  their  methods  require 
two  cameras  at  each  stage,  our  method  requires  position  meas¬ 
urement  (stereo  or  range  sensing)  only  at  the  initial  stage. 
This  is  a  great  advantage,  since  accurate  measurement  of  posi¬ 
tion  by  stereo  or  range  sensing  usually  takes  time.  Besides, 
their  methods  are  based  on  corresponding  feature  points,  while 
our  method  is  not  limited  to  the  use  of  feature  points;  our 
method  works  for  other  types  of  features  such  as  line  segments 
and  contours. 

Once  the  initial  position  is  known,  we  can  compute  the 
subsequent  motion  successively  from  a  given  image  sequence. 
Errors  may  accumulate  at  each  stage,  but  this  is  not  a  serious 
problem  because  our  method  applies  to  finite  motion;  any 
image  frame  can  be  compared,  in  principle,  with  the  initial 
frame. 

This  fact  can  be  used  to  advantage  in  model-based  object 
recognition.  Suppose  we  want  to  know  the  position  and  orien¬ 
tation  of  an  object  in  the  scene  when  the  true  size  and  shape 
of  the  object  are  known  and  are  stored  in  a  database.  Since 
the  true  size  and  shape  are  known,  we  can  easily  generate  the 
image  of  the  object  placed  in  a  known  position  by,  say,  a  com¬ 
puter  graphics  technique.  Then,  the  position  and  orientation 
of  the  object  are  determined  by  computing  the  “motion” 
between  this  reference  image  and  the  observed  image.  The 
computation  becomes  efficient  if  several  reference  images  and 
feature  values  are  precomputed  and  stored  in  the  same  data- 
base. 


712 


Acknowledgment.  The  author  thanks  Prof.  Azriel  Rosenfeld 

and  Prof.  Larry  S.  Davis  of  the  University  of  Maryland  for 

helpful  comments  and  discussions. 

REFERENCES 

1.  G.  Adiv,  Determining  3-D  motion  and  structure  from  opt¬ 
ical  flow  generated  by  several  moving  objects,  IEEE 
Trans.  Pattern  Anal.  Machine  Intel!,  PAlMI  7-4,  1985, 
384  -  401 

2.  G.  Adiv,  Inherent  ambiguities  in  recovering  3-D  motion 
and  structure  from  a  noisy  flow  field,  Proc.  DARPA 
Image  Understanding  Workshop,  Miami  Beach,  FL,  1985, 
pp.  399  -  412. 

3.  J.  Aloimonos  and  A.  Basu,  Shape  and  3-D  motion  from 
contour  without  point  to  point  correspondences:  general 
principles,  Proc.  IEEE  Conf.  Comput.  Vision  Pattern 
Recog.  Miami  Beach,  FL,  1986,  pp.  518  -527. 

4.  S.  Amari,  Invariant  structures  of  signal  and  feature 
spaces  in  pattern  recognition  problems,  RAAG  Memoirs, 
4,  1968,  553  -  566. 

5.  S.  Amari,  Feature  spaces  which  admit  and  dete  t  invari¬ 
ant  signal  transformations,  Proc.  4th  Int.  Joint  Conf.  Pat¬ 
tern  Recog.,  Tokyo,  1978,  pp.  452  -  456. 

6.  P.  Anandan,  Computing  Optical  Flow  from  Two  Frames 

of  an  Image  Sequence,  COINS  Technical  Report  86-16, 
Department  of  Computer  and  Information  Science, 

University  of  Massachusetts  at  Amherst,  April,  1986. 

7.  P.  Anandan  and  R.  Weiss,  Introducing  a  Smoothness 

Constraint  in  a  Matching  Approach  for  the  Computation 
of  Displacement  Fields,  COINS  Technical  Report  85-38, 
Department  of  Computer  and  Information  Science, 

University  of  Massachusetts  at  Amherst,  December,  1985. 

8.  A.  Bandopadhay  and  R.  Dutta,  Measuring  image  motion 
in  dynamic  images,  Proc.  IEEE  Workshop  on  Motion: 
Representation  and  Analysis,  Charleston,  SC,  May,  1986, 
pp.  67  -  72. 

9.  B.  F.  Buxton,  D.  W.  Murray,  H.  Buxton  and  N.  S.  Willi¬ 
ams,  Structure-from-motion  algorithms  for  computer 
vision  on  an  SIMD  architecture,  Comp.  Phys.  Comm.,  37, 
1985,  273  -  280. 

10.  D-  Cyganski  and  J.  A.  Orr,  Applications  of  tensor  theory 
to  object  recognition  and  orientation  determination, 
IEEE  Trans.  Pattern  Anal.  Machine  Intel I,  PAMI  7-6, 
1985,  662  -  673. 

11.  W.  Enkelmann,  Investigation  of  multigrid  algorithms  for 
the  estimation  of  optical  flow  fields  in  image  sequences, 
Proc.  IEEE  Workshop  on  Motion:  Representation  and 
Analysis,  Charleston,  SC,  May,  1986,  pp.  81  -87. 

12.  B.  K.  P.  Horn  and  B.  G.  Schunck,  Determining  optical 
flow,  Arlif.  Intel l.,  17,  1981,  185  -  203. 

13.  K.  Kanatani,  Detection  of  surface  orientation  and  motion 
from  texture  by  a  stereological  technique,  Artif  Intel!., 
23,  1984,  213  -  237. 

14.  K.  Kanatani,  Tracing  planar  surface  motion  from  projec¬ 
tion  without  knowing  the  correspondence,  Comput. 
Vision  Graphics  Image  Process.,  29,  1984,  1  -  12. 

15.  K.  Kanatani,  Detecting  the  motion  of  a  planar  surface  by 
line  and  surface  integrals,  Comput.  Vision  Graphics 


Image  Process.,  29,  1984,  13  -  22. 

16.  K.  Kanatani,  Structure  and  motion  without  correspon¬ 
dence:  general  principle,  Proc.  9th  Int.  Joint.  Conf.  Artif. 
Intel l.,  Los  Angeles,  CA,  1985,  pp.  886  -  888. 

17.  K.  Kanatani,  Structure  and  motion  without  correspon¬ 
dence:  general  principle,  Proc.  DARPA  Image  Under¬ 
standing  Workshop,  Miami  Beach,  FL,  1985,  pp.  107  - 
116. 

18.  K.  Kanatani,  Structure  and  motion  from  optical  flow 
under  orthographic  projection,  Comput.  Vision  Graphics 
Image  Process,  (to  appear). 

19.  K.  Kanatani,  Structure  and  motion  from  optical  flow 
under  perspective  projection,  Comput.  Vision  Graphics 
Image  Process,  (to  appear). 

20.  K.  Kanatani  and  T.-C.  Chou,  Shape  from  texture:  general 
principle,  Proc.  IEEE  Conf.  Comput.  Vision  Pattern 
Recog.  Miami  Beach,  FL,  1986,  pp.  578  -  583. 

21.  R.  Kories  and  G.  Zimmermann,  A  versatile  method  for 
the  estimation  of  displacement  vector  fields  from  image 
sequences,  Proc.  IEEE  Workshop  on  Motion:  Representa¬ 
tion  and  Analysis,  Charleston,  SC,  May,  1986,  pp.  101  - 
106. 

22.  Z.-C.  Lin,  T.  S.  Huang,  S.  D.  Blostein,  H.  Lee  and  E.  A. 
Margerum,  Motion  estimation  from  3-D  point  sets  with 
and  without  correspondences,  Proc.  IEEE  Conf.  Comput. 
Vision  Pattern  Recog.  Miami  Beach,  FL,  1986,  pp.  194  - 
201. 

23.  H.  C.  LongueUHiggins,  A  computer  algorithm  for  recon¬ 
structing  a  scene  from  two  projections,  Nature,  239, 
1981,  133  -  135. 

24.  H.  C.  Longuet-Higgins,  The  visual  ambiguity  of  a  moving 
plane,  Proc.  R.  Soc.  Land.,  B-223,  1984,  165  -  175. 

25.  H.-H.  Nagel,  Representation  of  moving  rigid  objects  based 
on  visual  observations,  Computer,  14-8,  1981,  29  -  39. 

26.  J.  M.  Prager  and  M.  A.  Arbib,  Computing  the  optic  flow: 
the  MATCH  algorithm  and  prediction,  Comput.  Vision 
Graphics  Image  Process.,  24,  1983,  271  -  304. 

27.  A.  Rosenfeld  and  A.  C.  Kak,  Digital  Picture  Processing, 
Vol.  2,  Academic  Press,  New  York,  NY,  1982. 

28.  M.  Subbarao  and  A.  M.  Waxman,  Closed  form  solution 
to  image  flow  equations  for  planar  surface  in  motion, 
Comput.  Vision  Graphics  Image  Process,  (to  appear). 

29.  R,  Y.  Tsai  and  T.  S.  Huang,  Uniqueness  and  estimation 
of  three-dimensional  motion  parameters  of  rigid  objects 
with  curved  suriaces,  IEEE  Trans.  Pattern  Anal  Machine 
Intell.,  PAMI  6-1,  1984,  13  -  27. 

30.  S.  Ullman,  The  Interpretation  of  Visual  Motion,  MIT 
Press,  Cambridge,  MA.  1979. 

31.  A.  M.  Waxman  and  S.  Ullman,  Surface  structure  and 
three-dimensional  motion  from  image  flow  kinematics, 
Int.  J.  Robotics  Res.,  4-3,  1985,  72  -  94. 

32.  A.  M.  Waxman  and  K.  Wohn,  Contour  evolution,  neigh¬ 
borhood  deformation,  and  global  image  flow:  planar  sur¬ 
faces  in  motion,  Int.  J.  Robotics  Res.,  4-3,  1985,  95  -  108. 

APPENDIX  A.  CONJUGATE  FEATURE 
TRANSFORMATION 


713 


A.l  Conjugate  Transformation  of  Features 

The  principle  of  our  method  is  to  iteratively  transform 
one  region  into  another.  If  a  region  is  characterized  by  a  small 
number  of  parameters,  say  the  coordinates  of  the  corners  of  a 
polygonal  region,  the  transformation  is  easy;  only  these  param¬ 
eter  values  need  be  altered.  However,  if  the  region  contains  a 
large  amount  of  information,  e.g.,  when  the  surface  is  finely 
textured,  transformation  of  the  image  requires  large  amounts 
of  computation  time  and  memory  space.  If  this  is  the  case,  we 
need  not  transform  the  image;  all  we  need  is  the  transformed 
feature  values,  not  the  transformed  image  itself. 

Define  the  image  transformation  operator  TA  by 

X(x,y)=TAX(x,y),  (A.l) 

where  A  is  the  matrix  prescribing  the  transformation  of  eqns 
(2.5).  Since  the  inverse  transformation  of  eqns  (2.5)  is  given 
by  replacing  matrix  A=(Ai;)  by  its  inverse  A_1=(/tyT‘),  opera¬ 
tor  T A  is  defined  as  follows: 


Definition  1  (Image  Transformation  Operator). 


7*  X(x,y  )=X[f- 


AuX  +  A^y+Atif  .AziX+A&y+A&f 


A  si'z+Ajj'y  +A33/  Aslz+Awy+Awf  ^  ^ 

This  equation  states  that  the  value  of  the  transformed  image 
TaX  at  (x,y)  is  given  by  the  value  of  the  original  image  X  at 
the  point  obtained  by  applying  the  inverse  transformation  to 
the  point  (x,y). 

Now,  define  the  conjugate  transformation  operator  TA  by 

Definition  2  (Conjugate  Transformation  Operator). 

TiF\X\=F\TAX\.  (A  A) 

This  equation  states  that  the  value  of  feature  F\  .  |  of  the 
transformed  image  X  is  obtained  as  the  value  of  the 
transformed  feature  TAF{  .  ]  of  the  original  image  A'.  An 
essential  fact  is  that  operator  TA  acts  on  a  given  functional 
F  .  ]  by  TAF[  .  \=F\Ta(  .  )]  without  any  regard  to  particular 
images.  Hence,  we  need  not  transform  the  image  according  to 
eqn  (A.2);  measuring  feature  FIAT]  of  the  transformed  image 
X(x.y)  is  equivalent  to  measuring  feature  TX\X]  of  the  origi¬ 
nal  image  X(x,y).  Hence,  once  we  know  how  operator  TA  acts 
on  functionals,  the  computation  can  always  be  performed  on 
the  original  image. 

Consequently,  the  motion  parameters  a,  6,  c,  Ux,  n2,  Si3 
of  the  next  step  are  given  by  soiving 

I" t;c}»\x)  t;c?>\x\  t;c^ix\  t;c%\x\  r;c^>[jr|“|  I”  4”| 
r;<7«>pc|  r;c,«|x|  Tfcf”\x\  t;c%\x\  t;c^\x,  II  M 


■FOjX'l-r^FOlxf 

Fw|A']-:r;F'2>ix] 


A.2  Change  of  Variable* 

From  the  discussion  in  the  above,  it  remains  to  determine 


how  the  conjugate  transformation  operator  TA  acts  on  partic¬ 
ular  feature  functionals.  For  that  purpose,  we  give  here  some 
mathematical  preliminaries  in  preparation  for  the  subsequent 

discussion. 

Consider  two  points  (x,y)  and  (x  +  dx,y  +  dy) 
infinitesimally  far  apart  from  each  other  on  image  X ,  and  let 
(x,y),  (x+dx  ,y  +  dy)  be  the  corresponding  points  on  the 
transformed  image  X.  By  differentiating  the  equations  of 
transformation 

A  nx  +A  l2y  +A  13}  _  .A^x+A^y+A^if 

— ....  --<A6> 


A  3Xx  +A  32y  +A  33/ 


we  see  that  differentials  dx,  dy  are  related  to  differentials  dx , 
dy  by 

dx -—-dx+^-dy,  dy  =  ^-dx +^-dy,  (A.7) 

ox  dy  dx  dy 


di  /dx  dx  /dy  ~  t  /t.ii-X,,/  AKx-A,J 

dy /dx  dy /dy ^  AnX+Aay-eAtif  Atly-Anf  A^y-A^f 


\dy/dx  dy/dy^  AlxX-<rAKy+AK/  |^A„y-A2i/  Xasji-^sj  ^  ^ 

Let  us  consider  a  smooth  curve  L  on  the  xy-plane.  If 
(x(s),y(s))  is  its  parametric  form,  where  s  is  the  arc  length 
along  the  curve  L ,  the  unit  tangent  vector  (n^sj.n^s))  to  the 
curve  L  is  given  by 

".(S)=Jf,  "s(*)=f*-  (A.9) 

Let  L  be  the  curve  on  the  xy-plane  corresponding  to  the 
curve  L  on  the  xy-plane,  and  let  (x(s),y(s))  be  its  parametric 
form,  where  s  is  the  arc  length  along  the  curve  L  Substitut¬ 
ing  eqns  (A.7)  and  (A. 8)  in  ds2=dx2+dy2,  we  obtain 

ds2=Edx2+1F dxdy  +  Gdy2,  (A. 10) 


tr_A32(i2+y2)-2A3J(Ani+A2ly)+(An2+Ai2)f 

(A  31*  +A  32y  +A  33/)2 

JP—  nAx^+A^lA,t\fi-{A1IAx^+A^^A>/)fy+^AuAl^+A2lA^jf, 

(A3iX-h/432y-M3a^)2  (All) 

n  —  A3F(x2+y2)-2A32f(Al2x+A  yg^+jA^+A^)}2 
(A  3ji  +A  32y  +A  33/ )2 
Combining  this  with  eqns  (A.9),  we  have 

ds=r(s)ds,  (A. 12) 

where 

r(*  )=V£(*(»).»(»))» i(»)*+*f(»(» ),»(»))»,(* )*s(« )+<?(*(» ),y(»))>is(»)»(A.13) 

The  unit  tangent  vector  to  the  curve  L  is  given  by 
(ni,n2).  Since  both  (nj,n2)  and  (n 2 , n 2)  are  unit  vectors,  they 
are  related  by 


1  dx/dx  dx/dy 

/En^A2Fnxn3+Gn22  [dji/dx  dy/dy 


1 H 

'J  L"2j 


Now,  we  define  the  conjugate  transformation  TA  of  the 
unit  tangent  vector  (n(,n2)  by 

Tt  1  /  dx  dx  \ 

a  n  1= —  i —  -  •  ■  ■  t  —  n  t -4 — — — ti o), 

\J En2+2Fnxn3+Gn  2  &x  &y 


714 


(A.I6) 


Tin, 


(4w4f"*). 


’  y/ En {i+2Fnlni+Gn^  d*  ‘  dV 


and  the  conjugate  transformation  T 1  of  a  function  m(x,y) 
and  its  derivatives  by 


Tjm(x,y)=m(i(x,y),y(x,y)), 


(A. 16) 


T; (*,»),» (x,,)),  TJ^^-(i(x,y),j/(x,y)).  (A.17) 

Finally,  consider  the  region  S  on  the  iy- plane 
corresponding  to  a  region  S  on  the  xy-plane.  The  area  of  5  is 
given  by 


f.  dxdy  =/s  J (x,y)dxdy, 
where  J(x,y)  is  the  Jacobian  defined  by 
dx/dx  dx/dy 
dy/dx  dy/dy 


(A. 18) 


=\/  E(x,y)G(x,y)-F(x,yf 


_  e(^21^SZ--^22^3l)^ +(^  12^S1~^  U^ar^21^  iz)/  . 

(A  3(T  +A  32y  +A  3af)2 

A.3  Conjugate  Transformation  Operator 

The  feature  functionals  for  dot  features  have  the  form 

/!-Y]=  £  J{^i<yi<m(xi,yi),^j(^i,yi),~{ii,yi))  (A.20) 

By  Definition  2,  the  action  of  the  conjugate  transformation 
operator  Tj(  is  given  by 

r;j;.Yi=r,r„x] 

=  £  Jf  i, ,  j.  .m(i. .9.)). 

=  £  /(if*, ,»,),»(*, .»,).Ti’n{tl,y,),Tl^-{x,.y,).Tl—-[t, ,»,)).  (,\.21) 

P'tS  ¥  v 

where  P, (i;  ,y, )  are  the  transformed  positions  of  the  dots. 

The  feature  functionals  for  line  features  have  the  form 

l\X\=^JLJ(i,y,nl,niMx,y),~;^)ds.  (A. 22) 

By  Definition  2  and  the  rules  of  change  of  variables,  the  action 
of  the  conjugate  transformation  operator  TJ(  is  given  by 

r;/!x]-/[r„x| 

=  E  JL  d(*J.  »!.»»■ "•(*■»  ).—■(* 

-  E  /,  /(*(*•»)•»(*.» )Ta"i.rf|>2.Ya’n(i.»).r;-^-.r;-|-)r(f^.23) 

t,ci  • 

The  feature  functionals  for  area  features  have  the  form 

/[A'!=//s^(x,»,m(I,y),-^,^)dxdy.  (A  .24) 

The  action  of  the  conjugate  transformation  operator  Ty J  is 
given  by 

r;/|x|=/|r,*l 

=//,  *(* .» ■m(i  ■»)' 


=JJ,AHx,  MM  r;m(I.,).r;-^-,r;^)/(*.,)<**,(A.25) 


where  5  is  the  transformed  region  and  J(x,y)  is  the  Jacobian 
given  by  the  determinant  of  eqn  (A. 8). 


APPENDIX  B.  GRAY-LEVEL  IMAGES 

As  discussed  earlier,  gray-level  images  are  not  adequate  in 
general  for  stable  feature  detection.  However,  if  the  value 
X(x,y)  of  “image  intensity”  can  be  assumed  to  be  inherent  to 
the  object  and  is  not  influenced  by  the  viewing  orientation,  for 
example  if  it  is  chromaticity,  we  can  also  use  the  direct  filter¬ 
ing  proposed  by  Amari  [3,  4]  in  the  form 

F[A'l=//sm(z,y)Y(i,y)didy.  (B.l) 

If  each  point  (x,y)  is  displaced  by  (Ax, Ay),  the  difference 
AF|Ar]  of  the  feature  values  is  given  by 

*FlX\  =  ffs\^Ar  +  2!±Al,+(°gL+°§JL)m\X(z,,)Jziy+  •  ■  •  .  (B.2) 

where  ...  again  denotes  higher  order  terms  in  Ax,  Ay  and 
their  derivatives.  Substituting  the  equation  of  optical  flow 
(2.10),  we  find 

CJXhffs f^dxdy,  CjXHffs  Xdxdy, 

Ca  [X}=ffs\m+x^\Xdxdy,  GBlX\=JJsy^Xdxdy, 

Cc[X\=JJsx^.Xdxdy,  CD[X]=JSs[m+y&t}Xdxdy,  (B.3) 
CE  (Y|=//5  |3xm  ~z2~-~xy~)Xdxdy, 

CF  [Y)=//s  |3ym  +xy-^+y2-~-\Xdxdy 

If  we  set  X(x,y)=l,  all  the  above  reduces  to  eqns  (4.11). 

The  functionals  here  defined  as  area  integrals  have  the 

form 

/lA’l =ffsJ(T’ym(x'y),-j^’-j^)X(x,y)dxdy.  (B.4) 

The  action  of  the  conjugate  transformation  operator  T X  is 
given  by 

r;/|xi=/|r^x| 

=//j  Ai.i  •"•(*• ,j))X(i  ,y)iiiy 

where  S  is  the  transformed  region  and  J(x,y)  is  the  Jacobian 
given  by  the  determinant  of  eqn  (A. 8). 

Thus,  if  we  take  six  independent  functions  m,(x,y), 
i  =  l,  .  .  .  ,  6,  we  obtain  the  corresponding  features  in 

the  form  of  eqn  (B.l),  and  C.<‘>[X],  C/’ljX],  Ce('>(X],  C^[X j, 

cKlm,  CfllX  1,  .=1 - 6,  are  given  by  eqns  (3.4). 

Hence,  the  motion  is  recovered  by  the  procedure  described  in 
Section  3. 

FIGURES 

Fig.  1  A  point  (X,Y,Z)  on  a  plane  having  equation 
z=px+qy+r  is  projected  to  point  ( x,y )  on  the  image 
plane  by  perspective  projection  from  the  viewpoint 
(0,0,-/).  The  motion  of  the  plane  is  specified  relative 
to  a  reference  point  (X0,Y0,Z0)  taken  on  it. 


715 


Fig.  2  A  rigid  motion  is  specified  by  translation  (a,b,c)  of  a 
reference  point  and  rotation  R  around  it. 

Fig.  3  Composition  of  two  rigid  motions.  Translations  are 
added,  and  rotations  are  multiplied. 

Fig.  4  Scheme  of  initial  adjustment  by  size  comparizon. 

Fig.  5  An  example  of  unconstrained  motion. 

Fig.  6  An  example  of  unconstrained  motion. 

Fig.  7  Horizontally  constrained  motion. 

Fig.  8  An  example  of  horizontally  constrained  motion. 

Fig.  9  An  example  of  horizontally  constrained  motion. 

Fig.  10  (a)  An  image  sequence  of  a  moving  planar  surface 
undergoing  a  horizontally  constrained  motion.  For 
each  image,  random  noise  of  10%  of  its  size  is  added 
to  the  corner  positions.  The  true  images  are  indicated 
by  a  dotted  line,  (b)  Computed  surface  positions 
viewed  from  above  (i.e. ,  along  the  y-axis).  A  solid 
line  is  used  for  true  positions,  a  broken  line  for  posi¬ 
tions  computed  from  the  noisy  images  in  Fig.  10(a), 
and  a  dotted  line  for  positions  computed  by  adding 
additional  10%  noise  to  the  initial  estimates  of  the 
motion  parameters. 


z=px+qy  +r 


(X0>Y0,Z0)  (X,Y,Z)j 


i  ' 
i  i 

i  ' 

_i  / 


{x°’y°);  /(*,y) 

i/  - 

-/r 


R 


l  (X,Y,Z)  1  \ 

\  r<w) 

(x»  v«)  m 


z—p'x+q'y+r' 


z—px  +  qy+r 


Fig.  2 


z—p'x+q'y+r' 


v /(Wo') 


(X0,Y0,Z0)\VamX^yo,Z0)  [c 


z=px+qy  +  r 


z—px  +  qy  +  f 


Fig.  3 


(Xq,Y0,Zq) 


Fig.  1 


Fig.  4 


716 


A  Unified  Perspective  on  Computational 
Techniques  for  the  Measurement  of  Visual  Motion* 


P.  Anandan* 

Computer  and  Information  Science  Department 
University  of  Massachusetts 
Amherst,  Ma  01003 


ABSTRACT 

A  unified  hierarchical  computational  framework  is  developed  for 
the  determination  of  dense  displacement  fields  from  a  pair  of  im¬ 
ages.  This  framework  incorporates  the  separation  of  computa¬ 
tions  according  to  scale,  a  coarse-to-fine  control  strategy,  the  ex¬ 
plicit  use  of  direction-dependent  confidence  measures,  and  a  rig¬ 
orous  formulation  of  a  smoothness  constraint  on  image  motion 
The  recent  hierarchical  extensions  hy  Ulazer  and  Enkelmann  of 
the  well  known  gradient-based  techniques  of  Horn  and  Schunck 
and  Nagel,  and  our  own  matching  technique,  which  is  based  on 
the  minimization  of  the  sum-of-squared-differences  (SSD)  of  the 
intensities,  are  all  shown  to  be  completely  consistent  with  our 
framework.  It  is  also  shown  that  the  gradient-based  techniques 
can  be  regarded  as  mathematical  limits  of  the  matching  tech¬ 
nique.  Since  our  framework  allows  the  explicit  identification 
of  the  various  components  that  are  necessary  for  the  success¬ 
ful  measurement  of  motion,  the  development  of  a  single  robust 
system  which  incorporates  the  best  implementation  choices  for 
each  component  is  within  reach. 

A  modified  version  of  our  framework  is  proposed,  in  which 
the  computations  are  separated  according  to  scale  and  orienta¬ 
tion.  This  new  approach  unifies  the  recently  developed  spatio- 
temporal  energy  models  with  the  gradient-based  and  the  match¬ 
ing  techniques,  and  appears  to  be  biologically  feasible  and  ideally 
suited  for  connectionist  models  of  computation. 

1  Introduction 

1.1  Background 

This  paper  describes  an  attempt  to  develop  a  unified  com¬ 
putational  framework  for  the  measurement  of  visual  mo¬ 
tion  from  a  sequence  of  images.  In  computer  vision,  the 
most  popular  approach  for  motion  analysis  has  been  to 
measure  image-motion  prior  to  any  determination  of  geo¬ 
metric  structures.  These  measurements  are  then  used  to 
recover  the  3-dimensional  structure  of  the  environment  and 
to  determine  the  relative  3-D  motion  between  the  camera 
and  the  objects  in  the  scene.  In  this  sense,  the  measure- 

*This  research  was  supported  by  DARPA  under  grant  N00014-82- 
K-0464 

•Current  Address:  Department  of  Computer  Science,  Al  project, 
Yale  Univgrir'y,  New  Haven,  CT  06520 


ment  of  visual  motion  is  treated  synonymously  with  the 
measurement  of  image-motion. 

The  research  effort  in  computer  vision  for  the  measure¬ 
ment  of  image-motion  has  focused  primarily  on  (a)  the 
determination  of  instantaneous  image-velocities  by  using 
a  “gradient-based"  approach,  or  (b)  the  determination  of 
displacements  of  points  between  successive  frames  using 
a  matching  approach.  On  the  other  hand,  recent  devel¬ 
opments  in  psychophysics  have  focused  on  the  determina¬ 
tion  of  the  speed  and  the  direction  of  motion  of  a  one¬ 
dimensional  visual  signal. 

The  research  on  the  interpretation  of  motion  [2,3,4|  in¬ 
dicates  that  a  dense  and  reliable  displacement  field  may 
be  necessary  for  the  successful  determination  of  the  struc¬ 
ture  of  the  environment.  The  most  effective  techniques 
for  computing  dense  displacement-fields  seem  to  be  those 
which  embed  either  a  gradient-based  or  a  matching  ap¬ 
proach  in  a  hierarchical,  multi-resolution  scheme.  Such 
techniques  include  the  multi- resolution  gradient-based  ap¬ 
proaches  of  Enkelmann  [12]  and  Glazer  [15],  and  our  own 
multi-resolution  matching  technique,  which  will  be  denoted 
here  as  Anandan’s  technique[6,7[. 

Although  the  gradient-based  and  the  matching  approaches 
appear  to  be  unrelated  to  each  other,  they  can  be  unified 
under  a  single  computational  framework.  Such  a  frame¬ 
work  is  described  in  this  paper  «.nd  shown  to  be  consistent 
with  the  different  current  approaches  for  the  measurement 
of  motion.  The  framework  is  developed  from  theoreti¬ 
cal  considerations,  and  each  of  its  major  components  is 
shown  to  be  essential  for  the  computation  of  dense  dis¬ 
placement  fields.  Besides  the  unification  of  a  wide  range 
of  current  techniques,  the  framework  also  allows  the  re¬ 
duction  of  the  task  of  designing  a  working  system  to  that 
of  making  proper  implementation  choices  for  the  various 
components  of  the  framework. 

1.2  Framework  overview 

The  key  idea  underlying  our  framework  is  the  separation 
of  computations  according  to  scale.  This  idea  is  based 
on  the  following  observation:  usually,  the  large  scale  (or 


719 


low  spatial-frequency)  intensity  variations  can  provide  im¬ 
precise  measurements  over  a  large  range  of  magnitudes  of 
motion,  while  the  small-scale  (or  high  spatial-frequency) 
variations  can  provide  more  accurate  measurements  over 
a  smaller  range.  This  leads  to  three  components  of  our 
framework:  spatial-frequency  decomposition,  which  is 
the  method  of  separating  the  intensity  variations  accord¬ 
ing  to  scale,  a  local,  parallel  match-criterion  within  each 
scale,  and  a  control-strategy,  which  is  a  method  for  con¬ 
trolling  the  measurement  processes  at  the  different  scales 
and  combining  their  results. 

Although  the  scale-based  separation  of  computation 
provides  a  useful  principle  for  processing  scenes  containing 
large  displacements,  there  will  always  be  situations  when 
an  image  area  lacks  sufficient  local  information  for  dis¬ 
placement  computation  at  a  particular  scale.  Also,  since 
the  image-displacement  is  a  vector  quantity,  its  reliability 
can  vary  according  to  direction.  Therefore  another  essen¬ 
tial  component  of  our  framework  is  a  direction-dependent 
confidence  measure.  The  presence  of  unreliable  dis¬ 
placements  also  means  that  in  order  to  obtain  a  dense 
displacement  field,  it  may  be  necessary  to  propagate  the 
reliable  displacements  to  their  less  reliable  neighbors.  This 
leads  to  the  last  essential  component  of  our  framework:  a 
smoothness  constraint,  which  specifies  the  criterion  for 
the  propagation  of  reliable  displacements. 

A  visual  illustration  of  this  framework  is  provided  in 
Figure  1.  The  five  major  components  mentioned  in  the 
above  description  are  discussed  in  detail  in  section  2. 

1.3  Paper  overview 

The  detailed  description  of  our  framework  for  computing 
dense  displacement  fields  from  a  pair  of  images  is  contained 
in  section  2.  First,  the  goals  of  the  displacement  computa¬ 
tion  process  are  stated.  This  statement  consists  of  speci¬ 
fying  the  nature  of  the  input,  the  requirements  on  output, 
and  the  computational  constraints  for  this  process.  This 
is  followed  by  the  detailed  descriptions  of  the  five  compo¬ 
nents  of  the  framework. 

Section  3  contains  a  review  of  the  principles  underlying 
current  techniques  in  computer  vision  and  an  outline  of  the 
three  specific  algorithms  of  Enkelmann,  Glazer,  and  Anan- 
dan.  Section  4  explains  the  relationship  of  these  three  tech¬ 
niques  to  our  framework  by  identifying  the  implementation 
choices  for  the  five  components.  In  section  4,  we  also  show 
that  the  two  gradient-based  techniques  can  be  regarded  as 
the  mathematical  limits  of  the  matching  approach.  Sec¬ 
tion  5  contains  a  discussion  of  the  problems  involved  in 
processing  discontinuities  in  image  motion,  while  Section  6 
contains  the  results  of  applying  Anandan’s  matching  tech¬ 
nique  to  a  pair  of  real  images,  as  a  demonstration  of  the 
feasibility  of  the  ideas  contained  in  our  framework. 

As  such,  the  spatio-temporal  energy  models  [1]  are  not 
consistent  with  the  computational  framework  described 
here.  This  is  because  the  energy  models  have  been  formu- 


confidenc y  disp. 


confidence J  disp«f 

[SMOOTHING! 


Figure  1:  The  hierarchical  computational  framework 

lated  primarily  for  one-dimensional  signals  varying  con¬ 
tinuously  over  time.  The  use  of  the  energy  models  for 
two-dimensional  sequences  would  involve  decomposing  the 
input  intensity  image  according  to  orientation.  Such  an 
approach  can  be  found  in  the  recent  work  of  Heeger  [18]. 
In  section  7,  we  propose  a  modified  form  of  our  frame¬ 
work,  in  which  all  three  categories  of  techniques  are  uni¬ 
fied.  This  new  approach  appears  to  be  biologically  feasible 
and  naturally  suited  for  connectionist  models  of  compu¬ 
tation.  In  our  own  future  research,  we  intend  to  explore 
possible  methods  of  implementing  this  approach. 

2  The  Computational  Framework 

2.1  The  computational  goals 

The  goals  of  the  process  of  computing  image  displacements 
are  determined  by  three  major  factors:  the  nature  of  its 
input,  the  requirements  on  the  output,  and  computational 
efficiency  constraints.  The  input  is  a  pair  of  digitized 
frames  belonging  to  a  discrete  image  sequence.  The  im¬ 
age  displacements  may  be  due  to  a  general  3-dimensional 
motion  of  the  camera  or  the  independent  motion  of  ob¬ 
jects  in  the  scene.  The  output  should  be  a  dense  field  of 
displacement  vectors  with  associated  confidence  measures. 
AH  the  computations  must  be  pixel-parallel  and  use  local 


720 


image  information.  Our  motivations  for  choosing  this  set 
of  goals  are  explained  below. 

The  input  -  in  typical  video  sequences,  the  inter-frame 
displacements  are  usually  considerably  larger  than  a  pixel. 
Usually,  due  to  the  presence  of  independently  moving  ob¬ 
jects,  a  single  set  of  3-D  motion  parameters  will  not  be 
consistent  with  the  entire  image.  The  objects  in  the  en¬ 
vironment  are  assumed  to  be  composed  of  continuous  and 
opaque  surfaces.  It  is  also  assumed  that  (i)  within  the 
image-area  covered  by  a  single  surface,  the  displacement 
field  varies  smoothly,  and  (ii)  the  image-motion  can  be  de¬ 
scribed  as  “locally  translational”,  i.e.,  within  a  small  area 
of  the  image,  the  displacement  field  can  be  approximated 
by  a  translational  flow  field.  Discontinuities  in  image  mo¬ 
tion  will  occur  at  the  boundaries  of  surfaces  and  objects. 

The  assumptions  stated  above  are  satisfied  in  a  large 
class  of  real  images.  The  major  exceptions  arise  when  the 
images  contain  transparent  and/or  “fence-like”  surfaces, 
when  the  points  on  an  object  undergo  non-rigid  or  chaotic 
motion,  and  when  the  amount  of  rotation  is  significant. 

The  output  -  The  requirement  that  the  output  should 
be  a  dense  displacement  field  with  a  confidence  measure 
is  derived  from  the  conclusions  of  the  various  studies  con¬ 
cerning  the  problem  of  extracting  structure  from  motion 
{2,3,13,31,37],  These  studies  indicate  that  a  large  num¬ 
ber  of  image  displacements  are  necessary  for  the  accurate 
determination  of  the  structure  of  the  environment.  In  ad¬ 
dition,  there  should  be  an  indication  of  the  reliability  of 
the  displacements,  so  that  inaccurate  ones  can  be  ignored. 
Adiv’s  approach  (2|  which  appears  to  be  at  least  partially 
successful  at  segq^iting  independently  moving  objects,  as¬ 
sumes  no  a  priori  knowledge  of  the  image-locations  of  such 
objects.  Therefore,  the  density  of  the  displacement  vector 
field  should  be  large  uniformly  across  the  image.  These 
conclusions  are  further  supported  by  his  analysis  of  the 
inherent  ambiguities  in  the  interpretation  of  motion  |3). 

Computational  considerations  -  The  considerations 
of  computational  efficiency  and  ease  of  implementation 
suggest  that  the  following  three  properties  are  desirable 
for  all  of  our  computations:  parallelism,  uniformity,  and 
localness. 

Parallelism  simply  means  that  it  should  be  possible  to 
perform  all  computations  simultaneously  at  all  locations 
on  the  image-plane.  Uniformity  implies  that  the  process 
should  be  similar  at  all  locations.  It  should  be  possible 
to  describe  any  differences  between  the  computations  at 
different  locations  in  terms  of  a  few  simple  parameters. 
Localness  means  that  the  computations  at  any  point  on 
the  image  should  be  based  on  information  local  to  that 
point. 


2.2  Spatial-frequency  decomposition 

As  noted  briefly  at  the  beginning  of  this  paper,  the  key  idea 
underlying  our  computational  strategy  is  the  separation  of 
computation  on  the  basis  of  scale.  Intuitively,  it  is  clear 
that  while  small  scale  intensity  structures  can  be  used  to 
measure  displacements  over  a  short  range,  they  may  have 
many  duplicate  matches  over  a  large  range.  This  leads 
to  ambiguities  in  the  computation  of  the  displacements. 
Therefore,  in  order  to  process  large  displacements,  large 
scale  intensity  information  must  be  used.  However,  a  sin¬ 
gle  displacement  computed  on  the  basis  of  a  large  scale 
intensity  structure  will  be  an  average  of  the  displacements 
over  the  area  covered  by  that  structure;  hence,  its  accu¬ 
racy  will  be  low.  Such  a  “smoothed”  displacement  field 
will  also  vary  slowly  over  the  image  plane;  hence,  it  can  be 
sampled  at  a  lower  rate  without  loss  of  information. 

These  observations  lead  to  the  following  principle:  large- 
scale  image  structures  can  be  used  to  measure  displace¬ 
ments  over  a  large  range  with  low  accuracy  and  at  a  low 
sampling  density,  while  small-scale  image  structures  can 
be  used  to  measure  displacements  over  a  short  range  with 
higher  accuracy  and  at  a  higher  sampling  density.  An  obvi¬ 
ous  way  to  enforce  this  principle  is  to  decompose  the  image 
into  its  spatial-frequency  components.  Such  a  decompo¬ 
sition  and  the  subsequent  processing  can  be  achieved  by 
using  a  set  of  spatial-frequency  channels. 

Since  the  lower-frequency  information  can  be  sampled 
at  a  lower  rate  without  any  significant  loss  of  informa¬ 
tion,  the  spatial-frequency  decomposition  process  is  usu¬ 
ally  accompanied  by  a  corresponding  reduction  of  resolu¬ 
tion  |9,38].  Such  an  approach  leads  to  a  pyramid  represen¬ 
tation  of  the  spatial-frequency  channels  and  fits  naturally 
into  a  pyramid  [21,32]  or  a  processing-cone  [17|  architec¬ 
ture.  However,  since  the  final  choice  of  a  representation 
scheme  depends  on  the  type  of  hardware  which  is  used, 
a  pyramid  representation  is  not  an  essential  part  of  our 
framework. 

2.3  The  match-criterion 

As  noted  earlier,  the  match-criterion  is  a  method  for  deter¬ 
mining  the  displacements  within  each  channel.  Since  the 
displacement  measured  is  small  with  respect  to  the  scale 
of  the  intensity  variations  within  a  channel,  a  gradient- 
based  approach  can  be  used  (see  [12,15]).  Alternatively, 
a  correlation-matching  approach  [6,10,14]  or  a  symbolic 
matching  approach  based  on  primitive  tokens  1 16,23]  can 
also  be  used.  The  separation  of  matching  according  to 
scale  implies  that  the  match-criterion  should  have  a  scaling 
property,  i.e.,  the  measurement  processes  within  different 
channels  should  be  scaled  versions  of  each  other.  Note  that 
such  a  scaling  property  is  directly  provided  in  a  pyramid 
representation. 


721 


2.4  The  control-strategy 

The  control-strategy  determines  how  the  measurement  pro¬ 
cesses  at  different  scales  are  controlled  and  how  their  re¬ 
sults  are  combined.  In  our  framework,  the  control  strategy 
is  based  on  a  spectral  continuity  principle  [  16,24),  which  can 
be  described  as  follows:  For  images  of  opaque  surfaces,  it 
can  be  assumed  that  the  displacement  estimates  at  corre¬ 
sponding  image  locations  in  the  different  channels  must  be 
similar  because  they  are  due  to  relative  motion  between 
the  camera  and  the  same  environmental  area.  This  means 
that  at  any  image  location,  a  displacement  computed  from 
a  high-frequency  channel  must  be  consistent  with  the  esti¬ 
mates  from  the  low-frequency  channel  at  the  corresponding 
image  location. 

A  simple  way  of  enforcing  the  spectral  continuity  prin¬ 
ciple  is  by  a  “coarse- to- 0116”  control  strategy.  In  this  strat¬ 
egy,  the  processing  proceeds  from  the  low-  to  the  high- 
frequency  channels.  The  displacement  estimate  for  a  pixel 
in  a  low-frequency  channel  determines  the  center  of  the 
search  area  for  the  correspoding  pixels  in  the  next  higher 
frequency  channel.  The  scale-invariance  property  of  the 
measurement  process  suggests  that  the  radius  of  the  search 
areas  between  two  adjacent  channels  should  be  propor¬ 
tional  to  the  scale  factor  in  order  to  ensure  scale  invariance 
of  the  computations.  Once  again,  note  that  such  scaling  is 
automatically  achieved  in  the  pyramid  representation. 

2.5  The  confidence  measure 

In  general,  there  will  be  areas  of  the  image  with  insuffi¬ 
cient  information  at  a  particular  scale  for  the  local  deter¬ 
mination  of  displacements.  Therefore  a  confidence  mea¬ 
sure  should  be  computed  along  with  each  match  at  each 
scale  to  indicate  whether  or  not  to  accept  that  match  for 
further  processing. 

Since  the  image  displacement  is  a  vector  quantity,  it  is 
possible  that  different  directional  components  of  the  dis¬ 
placements  may  be  locally  computable  with  different  de¬ 
grees  of  reliability.  For  instance,  it  is  intuitively  clear  that 
in  a  homogeneous  area  of  the  image  no  component  of  the 
displacement  can  be  reliably  estimated.  At  a  point  along 
a  line  (or  an  edge),  the  component  perpendicular  to  the 
line  can  be  reliably  computed,  while  the  component  par¬ 
allel  to  the  line  may  be  ambiguous.  Finally,  at  a  point  of 
high  curvature  along  an  image-contour  it  may  be  possible 
to  completely  and  reliably  determine  the  displacement  vec¬ 
tor  based  on  local  information.  These  observations  suggest 
that  the  confidence  measure  should  be  directionally  selec¬ 
tive,  i.e.,  that  it  should  associate  different  confidences  with 
the  different  directional  components  of  the  displacement 
vector.  In  addition,  while  an  area  may  be  homogeneous 
at  one  scale,  it  may  have  information  useful  for  reliable 
matching  at  a  different  scale.  Hence,  the  confidence  mea¬ 
sures  should  be  separately  computed  within  each  spatial- 
frequency  channel. 


2.6  Smoothness  constraint 

The  computation  of  a  dense  displacement  field  may  re¬ 
quire  “filling  in”  areas  with  unreliable  displacements  based 
on  the  reliable  displacements  in  their  neighborhood.  Such 
a  filling  in  process  can  be  based  on  the  assumption  that 
the  displacement  field  varies  smoothly  over  the  image  area 
covered  by  a  single  surface. 

The  smoothness  assumption  is  violated  at  locations  of 
discontinuities  in  the  image  motion.  Such  discontinuities 
arise  at  surface  boundaries  of  a  single  object,  or  at  object 
boundaries  due  to  independent  movement  of  two  different 
objects.  Any  scheme  that  uses  the  smoothness  assumption 
should  also  include  mechanisms  for  detecting  and  process¬ 
ing  such  violations.  While  some  techniques  (e.g.,  see  [6|) 
have  methods  for  processing  the  known  violations  of  the 
smoothness  constraint,  the  detection  of  discontinuities  in 
image  motion  remains  a  major  unsolved  problem. 

The  most  common  use  of  the  smoothness  assumption 
can  be  found  in  gradient-based  techniques  for  determining 
image-velocities.  In  these  techniques,  a  smoothness  con¬ 
straint  is  formulated  as  a  variational  problem  involving  the 
minimization  of  an  error  associated  with  a  velocity  field.  In 
our  framework,  we  suggest  the  use  of  a  similar  smoothness 
constraint  within  each  channel.  After  the  displacements 
and  the  associated  confidences  are  computed  within  each 
channel,  the  displacement  field  should  be  smoothed  before 
it  is  projected  to  the  next  higher  frequency  channel.  Dur¬ 
ing  the  smoothing  process,  the  confidence  measures  should 
be  used  to  retain  the  reliable  displacements  while  allowing 
the  less  reliable  estimates  to  change. 

3  The  Major  Approaches  in  Com¬ 
puter  Vision  for  the  Measure¬ 
ment  of  Motion 

As  noted  in  the  introductory  section,  the  current  computer 
vision  techniques  for  the  measurement  of  motion  can  be 
divided  into  the  gradient-based  techniques  and  matching 
techniques.  In  this  section,  the  general  principles  underly¬ 
ing  the  computations  each  of  these  two  categories  of  tech¬ 
niques  are  described.  The  two  hierarchical  gradient-based 
algorithms  of  Enkelmann  and  Glazer  and  the  hierarchical 
matching  algorithm  of  Anandan  are  outlined.  The  detailed 
reviews  of  several  other  techniques  that  belong  to  each  cat¬ 
egory  can  be  found  in  [7). 

3.1  The  gradient-based  approaches 

Almost  all  the  techniques  for  measuring  the  instantaneous 
image- velocities  use  the  gradient-based  approach.  This  ap¬ 
proach  is  based  on  the  assumption  that  the  intensity  of 
light  reflected  by  a  point  on  an  environmental  surface  and 
recorded  in  the  image  remains  constant  during  a  short  time 
interval,  although  the  location  of  the  image  of  that  point 


722 


s 


1 


may  change  due  to  motion.  This  assumption  leads  to  the 
following  equation,  which  is  called  the  intensity  constraint: 

|V/|f/x  =  -/„  (1) 

where  |V/|  is  the  magnitude  of  the  intensity  gradient- 
vector  V/  =  (/„,/*),  I,  is  the  temporal  derivative  of  the 
intensity  function,  and  U1  is  the  component  of  the  image 
velocity*  U  parallel  to  VI.  The  other  component  f/r,  along 
the  direction  perpendicular  to  VI,  is  unspecified  by  this 
constraint.  Since  the  orientation  of  the  intensity  gradient 
vector  is  normal  to  the  direction  of  the  “edge”  at  a  point, 
U 1  and  UT  are  respectively  called  the  “normal-flow”  and 
the  “tangential-flow”  components  of  the  edge.  The  lack 
of  information  regarding  the  tangential-flow  component  is 
known  as  the  “aperture  problem”. 

Since  the  intensity  constraint  specifies  only  one  com¬ 
ponent  of  the  image-velocity  at  a  point,  an  additional  con¬ 
straint  is  necessary  to  completely  determine  the  velocity 
vector,  usually  given  in  the  form  of  a  smoothness  con¬ 
straint  on  the  velocity  field  [20,19,27,29].  For  this  paper, 
Horn  and  Srhunk’s  formulation  [20]  and  Nagel’s  formula¬ 
tion  [27,29]  are  of  particular  interest,  because  they  are  re¬ 
spectively  used  in  the  multi-resolution  gradient-based  al¬ 
gorithms  of  Glazer  and  Enkelmann.  These  two  formula¬ 
tions  of  the  smoothness  constraints  are  discussed  in  detail 
within  the  descriptions  of  the  hierarchical  techniques  given 
below.  It  should  be  noted  that  both  techniques  apply  the 
intensity  constraint  for  the  computation  of  displacement. 
Such  an  approximation  of  velocities  by  displacements  is 
reasonable  because  of  the  hierarchical  processing  schemes 
used,  in  which  all  displacement  measurements  are  small 
compared  to  the  scale  of  image-intensity  variations. 

Enkelmann’s  approach 

Enkelmann  [12]  uses  the  low-pass  pyramid  transform 
described  by  Crowley  and  Stern  [11]  to  create  a  set  of 
Gaussian  low-pass  filters.  After  the  construction  of  a  low- 
pass  pyramid  from  each  image,  the  processing  begins  at 
a  particular  coarse  level;  the  description  of  his  technique 
does  not  specify  how  this  level  is  chosen.  At  the  coars¬ 
est  level,  the  initial  displacement  field  consists  of  vectors 
of  zero  length.  At  all  other  levels  the  initial  displacement 
field  is  determined  by  projecting  the  field  computed  at  the 
adjacent  coarser  level.  The  projection  process  involves  a  bi¬ 
linear  interpolation  of  the  displacement  vectors  in  a  small 
neighborhood  of  the  field  at  the  coarse  level. 

Within  each  level,  the  process  of  refining  the  initial  dis¬ 
placements  is  based  on  Nagel’s  gradient-based  approach 
[27,29].  In  this  approach,  an  area  around  each  pixel  in 
the  image  is  shifted  according  to  the  initial  displacement 
vector  at  that  pixel.  The  refinement  to  the  initial  dis- 

lIn  computer  vision,  the  convention  ha*  been  to  denote  the  image- 
velocity  by  0 .  However,  we  wi»h  to  reserve  that  symbol  for  the  image- 
displacement*  between  »ucce**ive  frame*.  Hence,  we  have  u«ed  U  to 
denote  the  velocity. 


placement  field  is  computed  by  minimizing  the  functional 
E  =  Ei„t  +  a2Em  where  £m(  is  a  formulation  of  the  inten¬ 
sity  constraint  mentioned  above,  and  Em  represents  the 
smoothness  assumption,  and  measures  the  spatial  varia¬ 
tion  of  the  displacement  field.  Mathematically, 

E,„t  -  JJdxdy  ( I(x,y )  -  J(x  +  u,y  +  t>))  (2) 

and 

Erm  ~  f f  dxdy  trace  [( V(/r)rW(Vf/r)]  ,  (3) 

where  /  is  the  intensity  function  of  the  shifted  first  image, 
J  is  the  intensity  function  of  the  second  image,  and  IV  is 
a  weight  matrix  which  depends  on  the  spatial  derivatives 
of  the  image  intensity  function  I.  The  inclusion  of  W  has 
the  effect  of  allowing  the  free  propagation  of  the  smooth¬ 
ness  assumption  along  image-contours,  while  restricting  its 
propagation  across  the  contours. 

By  using  the  Euler-Lagrange  equations,  and  ignoring 
the  second-order  terms  of  (u,v),  as  well  as  the  third  and 
higher  order  spatial-derivatives  of  the  intensity  function, 
the  functional  minimization  problem  is  transformed  to  that 
of  solving  the  following  differential  equations: 


AU  +  6-oJ 


trace  {W  V  Vu} 
trace  {W  V  Vv} 


=  0, 


(4) 


where 


and 


A  =  (V/)(V/)r  +  x2(VV  I)(VV /)T, 
6  =  A/(V /)  +  x*(VV/)(VAf). 


(■r>) 


(6) 


The  VV  operator  represents  the  matrix  of  second  deriva¬ 
tives,  e.g., 

VV/ 


_  (  Itz  I  TV  \ 

V  ^zv  ) 


and  x 2  denotes  the  size  of  a  small  image-window  around 
a  point  (x,y)  which  represents  that  point.  Note  that  in 
the  limit,  when  the  time  interval  between  the  two  frames 
tends  to  zero,  the  displacements  in  the  above  differen¬ 
tial  equation  can  be  replaced  by  the  corresponding  image- 
velocities,  provided  A/  is  replaced  by  / <,  the  temporal  in¬ 
tensity  derivative. 

By  using  the  finite-difference  approach,  the  differential 
equations  are  further  transformed  into  a  large  sparse  sys¬ 
tem  of  linear  equations.  These  linear  equations  are  then 
solved  using  a  multi-resolution  relaxation  approach.  The 
details  of  the  relaxation  process  are  not  relevant  for  the 
purposes  of  this  paper,  although  they  may  be  important 
for  an  efficient  implementation. 

Glazer’s  approach 


Glazer  also  uses  a  Gaussian  low-puss  pyramid  repre¬ 
sentation  of  the  input  images  and  employs  a  hierarchical 
version  of  the  Horn  and  Schunck  approach.  However,  the 


723 


exact  algorithm  for  the  construction  of  the  pyramid  is  dif¬ 
ferent;  Glazer  uses  Burt’s  Gaussian-pyramid  transforma¬ 
tion  described  in  [9].  After  the  construction  of  the  low- 
pass  pyramids,  the  processing  begins  at  a  coarse  level  at 
which  the  magnitudes  of  the  displacements  are  expected 
to  be  less  than  a  pixel.  A  coarse-to-fine  control-strategy  is 
used;  the  projection  of  the  displacements  between  adjacent 
levels  is  according  to  the  quad-tree  connectivity,  wherein 
each  pixel  at  a  coarse  level  is  regarded  as  the  ‘‘parent’’  of 
four  pixels  at  the  adjacent  finer  level.  Each  “child"  uses  the 
displacement  of  the  parent  pixel  as  its  initial  displacement. 

As  in  Enkelmann’s  approach,  a  window  around  each 
pixel  in  the  first  image  is  shifted  according  to  the  ini¬ 
tial  displacement  at  that  pixel.  The  refinement  process 
also  consists  of  minimizing  the  sum  of  two  functionals  Einl 
and  E$m ,  which  represent  the  intensity  constraint  and  the 
smoothness  assumption.  Glazer  defines  the  two  errors  as 

Einl=ff  dxdy  |V/|J(f/x  -  K1)1,  (7) 

and 

E,m  =  /I  dxdy(VUT)T(VUT).  (8) 

As  before,  Ux  is  the  component  of  the  displacement  vector 
U  parallel  to  the  intensity-gradient  vector  V/,  and  Vx  = 
-A//|V/|.  Once  again,  replacing  A /  by  /(  leads  to  a 

similar  equation  involving  the  image-velocity  U. 

Glazer  also  uses  the  Euler-Lagrange  equations  to  trans¬ 
form  this  problem  into  a  set  of  differential  equations,  and 
obtains  a  system  of  linear  equations  by  using  the  finite- 
difference  approach.  The  set  of  differential  equations  he 
obtains  are, 

v, [, vw + /,]-.=[;;“!<”;> ]-».  m 

where  the  operator  W  is  as  defined  above. 

He  also  uses  a  multi-resolution  relaxation  process  to 
solve  his  system  of  equations,  although  his  approach  is 
more  complex  than  Enkelmann’s  method  and  is  based  on 
recent  thoeretical  work  concerning  general  multi-level  re¬ 
laxation  techniques.  It  involves  dynamic  switching  be¬ 
tween  the  levels  (up  and  down)  according  to  the  current 
rate  of  convergence  during  the  course  of  the  process.  The 
hierarchical  gradient-based  approach  and  multi-level  relax¬ 
ation  are  both  described  in  detail  in  (15). 

3.2  The  matching  approaches 

Whereas  the  gradient-based  approaches  use  the  continu¬ 
ous  variations  of  the  intensity  over  time,  the  matching 
approaches  are  formulated  for  determining  the  displace¬ 
ments  of  image-events  from  two  discrete  frames.  The  type 
of  matching  approach  is  characterized  by  the  choice  of 
the  “image-event"  which  is  matched  and  the  criterion  for 
matching  that  event.  These  techniques  can  be  broadly 


classified  as  correlation-based  matching  approaches  and  sym¬ 
bolic  token  matching  approaches. 

Correlation-based  matching  -  Correlation-based  match¬ 
ing  involves  representing  an  image  pixel  by  a  template  win¬ 
dow  centered  at  that  pixel.  The  “match-measure”  between 
two  pixels  is  computed  by  comparing  the  intensities  at 
the  corresponding  positions  of  the  two  template  windows. 
The  most  common  match-measures  are  direct  correlation, 
mean  normalized  correlation,  variance  normalized  correla¬ 
tion,  and  the  sum-of-squared-differences.  In  some  cases, 
the  match  measure  may  be  a  weighted  sum  of  the  individ¬ 
ual  pixel  comparisons.  Usually,  the  weights  are  chosen  to 
increase  the  contribution  of  the  pixels.  The  “best-match” 
of  a  pixel  is  determined  by  optimizing  the  match-measure 
over  a  set  of  candidate-match  pixels. 

Symbolic-token  matching  -  Symbolic  matching  tech¬ 
niques  use  a  symbolic  representation  of  geometric  struc¬ 
tures  in  the  image  as  the  basis  for  matching.  Such  an 
approach  is  motivated  by  the  view  that  geometric  struc¬ 
tures  in  the  image  often  correspond  to  “interesting"  phys¬ 
ical  structures  in  the  3D  environment.  These  structures 
are  interesting  because  they  may  be  easily  distinguishable 
from  other  such  structures  and  are  likely  to  be  stable  over 
several  image  frames.  In  practice,  however,  the  extraction 
of  useful  symbolic  structures  is  difficult,  and  attention  is 
focused  on  interesting  image  points  and  edges. 

The  most  successful  matching  approaches  for  comput¬ 
ing  image-displacements  use  a  hierarchical  matching  ap¬ 
proach  (6,10,14,22,23,26,38).  In  most  of  these  techniques, 
the  hierarchical  approach  is  chosen  for  the  same  reasons 
as  those  used  in  the  development  of  our  framework.  Of 
these,  the  one  that  is  most  consistent  with  our  framework 
is  Anandan’s  technique.  This  technique  is  described  be¬ 
low.  For  a  detailed  discussion  of  the  relationship  of  each 
of  the  hiearchical  matching  techniques  to  our  framework, 
refer  to  (7). 

Anandan’s  hierarchical  matching  technique 

Anandan  uses  Burt’s  Laplacian  pyramid  transforma¬ 
tion  [9]  to  create  a  set  of  difference-of-Gaussian  filters.  This 
process  involves  computing  the  difference  between  the  im¬ 
ages  at  each  pair  of  adjacent  levels  of  Burt’s  Gaussian  pyra¬ 
mid.  The  Laplacian  pyramid  provides  a  set  of  band-pass 
filters  rather  than  a  set  of  low-pass  filters.  The  motivation 
for  choosing  band-pass  filters  was  based  on  empirical  stud¬ 
ies  [8]  which  indicate  that  correlation  matching  is  more 
reliable  when  such  filters  are  used. 

As  in  Glazer’s  algorithm,  the  process  begins  at  a  coarse 
level  at  which  the  magnitudes  of  the  displacements  are 
less  than  a  pixel.  A  coarse-to-fine  control  strategy  is  also 
used,  although  the  method  of  transferring  displacements 
to  a  finer  level  is  somewhat  different  from  that  used  by 
Glazer  and  Enkelmann.  In  this  method,  which  is  called 


724 


I 

the  overlapped  pyramid  projection  scheme,  each  child  pixel 
has  four  parents  and  can  therefore  have  up  to  four  distinct 
initial  displacements.  The  search  area  for  the  best  match 
consists  of  the  union  of  all  the  pixels  in  the  four  3x3  pixel 
areas  surrounding  the  four  initial  match  estimates.  For 
details,  see  [5,7]. 

The  match-criterion  used  is  the  minimization  of  Gaussian- 
weighted  sum-of-squared-differences  (SSD)  with  5x5  pixel 
template  windows.  A  search  process  was  used  to  deter¬ 
mine  the  best  match  pixel  contained  in  the  search  area  de¬ 
fined  above.  Recognizing  the  fact  that  a  match  so  obtained 
may  not  always  be  unique,  Anandan  computes  a  direction- 
dependent  confidence  measure  associated  with  the  best 
match. 

The  confidence  measure  is  derived  from  the  distribu¬ 
tion  of  the  SSD  values  around  the  best  match  estimate. 
Two  confidence  values  Cmax  and  which  arc  propor¬ 

tional  to  the  two  principal  curvatures  of  the  SSD  surface 
are  associated  with  the  displacement  components  along  the 
directions  of  the  two  principal  axes  of  that  surface.  The 
unit-vectors  along  the  principal  axes  are  denoted  as 
and  e„i„.  It  has  been  shown  that  these  measures  distin¬ 
guish  between  corners,  edges,  and  homogeneous  image  ar¬ 
eas  [6,7]. 

A  smoothness  constraint  is  formulated  as  the  minimiza¬ 
tion  of  the  sum  of  two  functionals  Eapp,  and  E,m.  Let 
D(x,y)  denote  the  displacement  obtained  for  the  pixel 
located  at  image  position  (x,y)  by  minimizing  the  SSD 
measure  as  noted  above.  The  functional  Eapr(U)  mea¬ 
sures  how  well  a  displacement-field  U(x,y)  approximates 
D(x,  y),  while  the  functional  E,„(U)  is  identical  to  the  one 
used  by  Glazer.  Mathematically, 

Eapp  ~  f  J  dxdV  [C—  ((tf  -  D)  ■  ema,y  + 

(10) 

C™B  ((t7  -  D)  ■  e™*)1]  . 

Anandan  uses  the  finite-element  approach  to  solve  the 
minimization  problem,  which  also  leads  to  a  relaxation  al¬ 
gorithm.  The  finite-element  method  is  chosen  because  it 
allows  the  inclusion  of  known  discontinuities  in  the  dis¬ 
placement  field;  the  propagation  of  the  smoothness  con¬ 
straint  across  such  discontinuities  can  be  avoided.  Several 
ideas  for  the  detection  of  discontinuities  are  proposed,  al¬ 
though  these  ideas  have  not  been  tested  thoroughly. 

4  Relationship  of  the  Hierarchical 
Approaches  to  Our  Framework 

This  section  explains  the  relationship  of  the  three  hierar¬ 
chical  techniques  to  our  framework.  Our  purpose  here  is  to 
show  that  all  the  components  of  the  framework  are  present 
in  each  of  the  three  techniques,  and  explain  the  particular 
implementation  choices  made  for  those  components. 


From  the  descriptions  given  above,  it  should  be  obvi¬ 
ous  that  all  techniques  use  spatial-frequency  channels,  and 
a  coarse- to- fine  control  strategy.  Glazer  and  Enkelmann 
each  use  Gaussian  low-pass  filters,  whereas  Anandan  uses 
band-pass  filters.  From  our  point  of  view,  band-pass  filters 
are  more  appropriate,  because  they  offer  a  greater  separa¬ 
tion  of  the  spatial-frequencies.  The  control-strategy  is  also 
similar,  except  for  the  differences  in  the  projection  scheme. 

As  noted  earlier,  Anandan  uses  the  minimization  of  the 
SSD  as  the  match-criterion,  computes  direction-dependent 
confidence  measures,  and  includes  a  smoothness  constraint 
which  uses  the  confidence  measures.  Hence,  his  technique 
is  completely  consistent  with  our  framework.  On  the  other 
hand,  the  match-criterion,  the  confidence  measures,  and 
the  smoothness  constraint  have  not  been  made  explicit  in 
the  two  gradient-based  techniques.  However,  a  close  anal¬ 
ysis  of  the  mathematical  formulations  of  the  minimization 
problems  reveals  that  these  three  components  have  been 
implicitly  used.  In  this  section,  we  establish  a  one-to- 
one  relationship  between  the  minimization  problems  for¬ 
mulated  in  each  of  the  gradient-based  techniques  and  in 
Anandan ’s  matching  technique. 

In  all  three  cases,  the  smoothness  constraint  is  formu¬ 
lated  as  the  functional  E,m.  Although  Enkelmann’s  for¬ 
mulation  appears  different  from  the  others  because  of  the 
inclusion  of  the  weight  matrix  W,  the  other  two  formula¬ 
tions  can  be  cast  in  the  same  form  by  using  the  identity 
matrix  in  the  place  of  IV. 

It  has  been  shown  in  [7]  that  a  correspondence  can 
be  made  between  the  functionals  jEin(  used  by  Glazer  and 
Enkelmann  and  the  Eapp  used  by  Anandan.  This  corre¬ 
spondence  can  be  summarized  as  the  following  theorem: 

Theorem  In  the  limit,  when  the  inter-frame  time  in¬ 
terval  tends  to  zero,  the  formulation  of  the  approximation 
error  for  image  displacements  used  in  the  discrete  matching 
approach  converges  to  the  second-order  formulation  of  Eint 
for  image-velocities  used  in  the  gradient-based  approach, 
provided  the  third  and  higher  order  spatial  intensity  deriva¬ 
tives  are  ignored.  Further,  when  the  window-size  repre¬ 
sented  by  x 1  tends  to  zero,  Eapp  converges  exactly  to  the 
first  order  gradient-based  formulation  of  Eint  ■ 

The  proof  of  this  theorem  is  too  long  for  this  paper,  and 
can  be  found  in  [7j.  However,  the  approach  is  outlined 
below,  which  also  identifies  the  confidence  measures  im¬ 
plicitly  used  in  the  gradient-based  formulations. 

It  has  been  shown  by  Nagel  [28]  that  if  the  third  and 
higher  order  spatial-derivatives  are  ignored,  the  minimiza¬ 
tion  of  the  SSD  measure  is  equivalent  to  solving  the  equa¬ 
tion  AU  =  —b,  where  A  and  b  are  as  defined  in  equa¬ 
tions  (5)  and  (6).  This  equation  is  reproduced  below: 

((V/)(V/)r  +  x*(W/)(VVJ)r)  U  =  (n) 

-  (f,(VJ)  +  x*(VV/)(V/|))  . 


725 


In  [7],  we  have  also  shown  that  in  this  limiting  case,  the 
principal  curvatures  and  the  associated  principal  axes  of 
the  SSD  surface  are  the  eigenvalues  and  the  corresponding 
eigenvectors  of  A.  Since  the  confidence  measures  computed 
by  Anandan  are  proportional  to  the  principal  curvatures, 
his  approximation  error  can  be  rewritten  as 

Eapp  =  fjdxdy  [a,  ((U  -  D)  ■  e,)1  +  A,  ((U  -  D)  •  e,)2 

(12) 

where  D  is  any  solution  to  equation  (11),  Ai  and  A2  are 
the  two  eigenvalues  of  A,  and  and  e2  are  the  associated 
unit  eigenvectors. 

Equation  (12)  can  be  further  rewritten  as 

Eapp  =  /I  dxdy(U  -  D)tA(U  -  D).  (13) 

In  [7]  we  also  show  that  the  use  of  Euler-Lagrange  equa¬ 
tions  on  the  functional  Eapp  leads  to  the  differential  equa¬ 
tion  (4).  Hence,  for  all  practical  purposes,  Enkelmann's 
intensity  constraint  given  in  equation  (2)  is  equivalent  to 
the  Eapp  used  by  Anandan. 

If  in  the  formulations  of  A  and  6,  the  window  size  x1  is 
set  to  zero,  equation  (11)  reduces  to 

(V  I)TU  =  -/„ 

which  is  equivalent  to  the  normal-flow  equation  (1).  More¬ 
over,  it  can  be  shown  that  in  this  case, 

A(  =  0,  Aj  =  |V/|S,  and  it  =  ev/.  (14) 

Wtth  these  substituions,  the  formulation  of  Eapp  given  in 
equation  (12)  reduces  exactly  to  that  used  by  Glazer.  From 
this,  it  is  clear  that  Glazer  determines  the  normal-flow 
component  F1  to  be  -/(/|V/|,  and  associates  a  confidence 
of  |V/|3  with  it.  The  tangential-flow  component  can  be 
arbitrarily  set  to  any  value  with  a  confidence  measure  of 
zero. 

The  direction-dependent  properties  of  the  second-order 
equation  AU  =  -6,  and  the  behavior  of  its  solution  at  cor¬ 
ners,  edges,  and  homogeneous  areas  have  recently  been 
discussed  by  Nagel  in  [30].  However,  the  fact  that  the 
eigenvalues  can  be  regarded  as  confidence  measures  of  the 
displacement-components  along  the  directions  of  the  eigen¬ 
vectors  has  not  been  mentioned.  In  particular,  the  key 
equations  (13)  and  (12)  have  not  been  derived;  instead,  in 
[29]  the  differential  equation  Au  —  -6  has  been  obtained 
in  a  more  indirect  manner.  The  two  major  advantages  of 
the  explicit  identification  of  the  confidence  measures  are: 
(i)  the  three  techniques  are  unified  into  a  single  framework, 
and  (ii)  the  confidence  measure  can  also  be  used  outside 
the  smoothness  constraint  as  in  [2[. 

In  summary,  the  three  most  successful  hierarchical  ap¬ 
proaches  appear  to  be  completely  consistent  with  our  frame¬ 
work.  Moreover,  the  three  approaches  are  essentially  the 
same.  The  distinctions  arise  in  the  exact  formulations  of 
the  various  components  (e.g.,  the  use  of  the  matrix  W 


in  the  smoothness  constraint,  the  method  of  coarse-to- 
fine  projection,  etc.),  and  in  the  type  of  input  assumed, 
i.e.,  continuous  image-stream  vs.  discrete  image  sequence. 
Since  our  framework  allows  the  explicit  identification  of 
the  various  components,  the  development  of  a  single  ro¬ 
bust  system  which  incorporates  the  best  implementation 
choices  for  each  component  is  within  reach. 

5  Processing  Discontinuities  in  Im¬ 
age  Motion 

In  this  section,  we  consider  one  of  the  major  unsolved  prob¬ 
lems  in  the  analysis  of  visual  motion  in  relation  to  our  com¬ 
putational  framework.  This  situation  involves  processing 
discontinuities  in  image  motion,  which  are  present  at  the 
boundaries  of  surfaces,  or  at  the  boundaries  of  objects. 
Around  the  locations  of  such  discontinuities,  the  smooth¬ 
ing  involved  in  the  spatial-frequency  decomposition  pro¬ 
cess  creates  intensity  structures  that  have  no  physical  cor¬ 
relates.  Therefore,  the  information  contained  in  the  lower- 
frequency  channels  in  the  two  images  will  be  inconsistent. 
The  local  translational  assumption  and  the  spectral  con¬ 
tinuity  principles  are  also  violated.  Finally,  it  would  be 
inappropriate  to  apply  the  smoothness  constraint  across 
such  boundaries. 

The  inconsistency  between  expected  behavior  of  the 
matching  process  and  the  actual  behavior  are  possible  clues 
to  the  detection  of  such  discontinuities.  For  instance,  a 
large  value  for  the  minimum  of  the  SSD  measure  could 
indicate  the  presence  of  occlusion  or  discontinuities  in  im¬ 
age  motion.  A  comparison  of  the  shape  of  the  auto-  and 
cross-  SSD  surfaces  may  also  help  in  the  identification  of 
such  discontinuities.  A  detailed  discussion  of  the  behavior 
of  the  SSD  surfaces  at  occluded  areas  and  at  locations  of 
discontinuities  in  image  motion  can  be  found  in  [7). 

6  Demonstrative  Results 

This  section  demonstrates  the  feasibility  of  the  ideas  con¬ 
tained  in  our  framework  by  showing  the  performance  of 
Anandan ’s  matching  algorithm  on  a  pair  of  real  images 
containing  both  camera  as  well  as  independent  object  mo¬ 
tion.  This  algorithm  has  been  tested  on  more  than  a  dozen 
image  pairs,  and  its  performance  appears  to  be  fairly  con¬ 
sistent  [7j. 

The  input  images  are  the  two  128  x  128  resolution  im¬ 
ages  shown  in  figure  2.  The  scene  consists  of  a  toy  di¬ 
nosaur,  a  toy  chicken  in  the  background,  and  a  tea  box  in 
the  foreground,  all  of  vhich  rest  on  a  table-top  which  has  a 
grid-pattern  on  it.  The  3-D  motion  between  the  two  frames 
consists  of  a  translation  of  the  camera  to  the  right,  along 
with  a  leftward  rotation  about  the  vertical  axis  (in  order  to 
bring  the  scene  back  into  view),  as  well  as  an  independent 
movement  of  the  toy  dinosaur.  This  scene  is  of  interest 


Figure  2:  The  input  images  for  the  dinosaur-image  experiment 


Figure  3:  The  un-smoothed  displacement  vector  fields  at 
the  finest  level  of  the  pyramid.  In  order  to  enhance  visibility 
only  a  32  x  32  sample  of  the  displacements  have  been 
shown. 

for  the  obvious  reason  that  it  contains  an  independently 
moving  object  besides  a  complex  camera  motion. 

Figure  3  shows  the  displacement  field  at  the  finest  level 
of  the  pyramid  computed  by  the  hierarchical  matching  pro¬ 
cess  without  smoothing,  while  Figure  4  displays  the  dis¬ 
placement  field  at  the  finest  level  produced  by  the  hi¬ 
erarchical  algorithm  with  smoothing.  Figure  5  displays 
the  smoothed  displacement  field  superimposed  on  the  first 
frame.  Figure  6  displays  the  confidence  measure  at  the 
finest  level. 


Figure  4:  The  smoothed  displacement  vector  fields  at  the 
finest  level  of  the  pyramid.  In  order  to  enhance  visibil¬ 
ity  only  a  32  x  32  sample  of  the  displacements  have  been 
shown. 


It  is  evident  from  the  figures  that  the  algorithm  per¬ 
forms  remarkably  well  on  this  real  image  with  complex 
motion.  It  also  appears  that  the  smoothness  constraint 
has  generally  been  useful  in  “filling  in”  at  several  areas 
of  the  image,  e.g.,  the  lower-right  portion  of  the  dinosaur, 
part  of  the  chicken,  and  the  floor. 

The  expected  behaviors  of  the  confidence  measures  at 
corners,  edges,  and  homogeneous  areas  are  confirmed  by 
the  displays  in  Figure  6.  In  order  to  make  these  behaviors 
more  explicit,  we  attempted  to  classify  the  image  pixels 


according  to  the  following  criteria: 

•  if  Cftitn  >  0.5  the  pixel  is  classified  as  a  corner, 

•  if  — ~  >  100  and  cmaz  >  1,  the  pixel  is  classified  as 

^min 

an  edge,  and 

•  if  Ciimi  <  0.5  and  <  5  the  pixel  is  classified  as  a 

£mm 

point  in  a  homogeneous  area. 


Figure  7  displays  the  results  of  this  classification. 

Two  areas  of  the  image  of  particular  interest  are  at  the 
boundary  between  the  chicken  and  the  dinosaur,  and  the 
boundary  between  the  tea  box  and  the  floor.  The  first  area 
has  been  maintained  correctly  during  the  smoothing  pro¬ 
cess.  This  is  because  the  confidence  values  are  rather  large 
for  the  vectors  on  either  side  of  this  boundary,  and  this 
prevents  those  vectors  from  changing  during  the  smooth¬ 
ing  process.  However,  the  area  of  the  floor  just  left  of  the 


Figure  5:  The  smoothed  displacement  vector  field  at  the 
finest  level  of  the  pyramid  superimposed  on  the  first  input 
frame.  In  order  to  enhance  visibility  only  a  32  x  32  sample 
of  the  displacements  have  been  shown. 


Figure  7:  The  dinosaur-image  experiment:  The  classifi¬ 
cation  of  pixels  as  corners,  edges,  or  homogeneous  areas. 
The  top-left  quadrant  contains  the  input  image.  In  the  im¬ 
ages  shown  in  the  top-right,  bottom-left,  and  bottom-right 
quadrants,  the  corners,  edges,  and  the  homogeneous  areas 
have  been  highlighted. 


Figure  6:  The  confidence  measures  at  the  finest  level  of  the  pyramid.  The  confidence  c^,  is 
displayed  on  the  left  as  an  intensity  image,  while  the  confidence  cmm  is  displayed  on  the  right. 
The  direction  vectors  are  superimposed  on  the  cmoI  image. 


720 


tea  box  has  displacements  that  are  obviously  incorrect,  be¬ 
cause  the  highly  reliable  displacements  at  the  edge  of  the 
tea  box  have  influenced  their  less  reliable  neighbors. 

7  The  Modified  Hierarchical  Frame¬ 
work 

7.1  An  overview  of  energy  models 

Recently,  a  number  of  psychophysical  models  [1,35,36]  for 
the  measurement  of  motion  in  the  human  visual  system 
have  been  shown  to  be  closely  related  to  each  other.  These 
have  been  grouped  under  the  title  “spatio-temporal  energy 
models”  by  Adelson  and  Bergen  (l). 

All  of  the  energy  models  have  focused  on  the  problem  of 
determining  the  direction  (and  possibly  the  speed)  of  mo¬ 
tion  of  a  one  dimensional  signal.  Based  on  the  fact  that  the 
movement  of  a  1-D  signal  with  a  sharp  gradient  creates  a 
linear  structure  in  the  spatio-temporal  domain,  the  prob¬ 
lem  of  determining  the  speed  of  the  signal  is  equated  to 
that  of  determining  the  slope  of  the  linear  structure.  As 
explained  by  Adelson  and  Bergen  [1],  the  determination 
of  this  slope  involves  the  construction  of  a  set  of  simple 
linear  spatio-temporal  energy  detectors.  Each  of  these  de¬ 
tectors  can  be  regarded  as  a  simple  linear  filter  which  is 
tuned  to  a  specific  area  of  the  visual  field  (i.e.,  an  area 
of  the  image  plane),  a  specific  range  of  spatial-frequencies, 
and  a  specific  “sign”  of  motion  (i.e.,  right,  left,  or  station¬ 
ary).  As  it  is  usually  the  case  with  simple  linear  filters, 
the  response  of  each  unit  increases  from  zero  to  a  peak 
value  within  a  finite  range  of  locations,  frequencies,  and 
speeds.  It  also  appears  that  an  inverse  relationship  ex¬ 
ists  between  the  spatial-frequency  and  the  speed  to  which 
a  unit  is  tuned,  i.e.,  units  that  are  tuned  to  low  spatial- 
frequencies  are  more  sensitive  to  larger  speeds  and  vice 
versa. 

Since  these  models  have  been  developed  recently,  there 
has  not  been  much  effort  to  apply  them  to  a  sequence  of 
two-dimensional  images.  An  effort  towards  this  problem 
can  be  found  in  the  current  work  of  Heeger  [18],  which 
involves  the  separation  of  the  two-dimensional  image  into 
a  set  of  one-dimensional  signals  by  using  filters  which  are 
tuned  to  specific  image-orientations.  However,  the  prob¬ 
lem  of  combining  the  information  from  the  units  tuned  to 
different  spatial-frequencies,  image-locations,  and  orienta¬ 
tions  has  not  yet  been  fully  addressed. 

7.2  Using  orientation-selective  filters 

The  inverse  relationship  between  the  spatial-frequency  and 
the  velocity  tuning  of  the  energy  measurement  unit  sug¬ 
gests  that  it  may  be  possible  to  modify  our  framework 
to  encompass  the  energy  models.  In  fact,  the  relation¬ 
ship  between  the  energy  models  and  the  gradient-based 
approach  for  velocity  computation  has  been  discussed  by 


Adelson  and  Bergen  [i|.  It  is  also  possible  to  cast  the  SSI) 
minimization  process  as  a  measurement  of  spatio-temporal 
energy.  Based  on  these  relationships,  we  have  recently 
proposed  a  modified  hierarchical  framework  which  unifies 
the  energy  models,  the  gradient-based  approaches  and  the 
matching  approach.  In  this  section,  we  provide  an  outline 
of  this  framework. 

As  illustrated  in  Figure  8,  the  new  framework  consists 
of  a  set  of  “motion-detection”  units.  Each  unit  is  “tuned” 
to  (i)  a  specific  location  on  the  image-plane,  (ii)  a  spe¬ 
cific  range  of  spatial-frequencies  and  an  associated  range 
of  speeds,  (iii)  a  specific  range  of  image-orientations  and 
an  associated  range  of  directions  of  movement,  and  (iv) 
a  “sign”  of  motion  (left,  right,  or  stationary).  A  confi¬ 
dence  measure  is  included  with  the  output  of  every  unit. 
Depending  on  whether  the  input  is  a  continuous  image- 
stream  or  a  discrete  set  of  images,  the  measurement  of 
spatio-temporal  energy,  the  gradient-based  approach,  or  a 
matching  approach  can  be  used  for  computing  the  response 
of  each  individual  unit. 

The  output  of  each  unit  can  itself  be  regarded  as  a 
measurement  of  motion.  One  method  of  combining  these 
outputs  in  order  to  obtain  the  image  displacement  field 
may  involve  using  the  following  set  of  combination  prin¬ 
ciples:  (i)  spectral  continuity,  which  was  explained  be¬ 
fore,  (ii)  spatial  coherence,  which  is  a  general  form  of  a 
smoothness-constraint,  (iii)  orientation  consistency,  which 
means  that  the  units  tuned  to  different  orientations  all 
measure  the  same  underlying  movement,  and  (iv)  tempo¬ 
ral  coherence,  which  means  that  the  velocity  of  a  point  in 
the  environment  varies  smoothly  over  time. 


Figure  8:  The  orientation-selective  framework 


729 


The  motivation  for  the  computation  of  image  flow  (i.e., 
dense  displacement  fields)  arises  out  of  the  numerous  anal¬ 
yses  that  relate  the  structure  of  the  image  flow  field  to  that 
of  the  environmental  surfaces  and  the  parameters  of  mo¬ 
tion.  In  more  general  terms,  however,  a  dense  displacement 
may  not  be  necessary.  Instead,  we  can  consider  methods 
that  directly  use  the  outputs  of  the  energy-measurement 
units  for  visual  perception  and  image  understanding. 

The  outputs  of  the  localized  energy-measurement  units 
can  feed  directly  to  various  higher-level  processes  for  a 
diverse  range  of  goals,  e.g.,  navigation,  spatial  organiza¬ 
tion,  determination  of  3-D  structure  {see  Figure  8).  In  this 
sense,  the  measurements  can  be  treated  as  “feature”  values 
with  an  associated  confidence  measure.  Equivalently,  the 
output  of  each  of  these  units  can  be  regarded  as  a  piece 
of  evidence  for  the  presence  of  an  intensity  structure  at 
a  specific  scale,  orientation,  and  velocity  with  an  associ¬ 
ated  uncertainty  measure.  There  are  a  number  of  schemes 
currently  under  investigation  for  the  combination  of  such 
probabilistic  “features”.  Most  of  these  approaches  appear 
suited  for  a  connectionist  model  of  computation. 

We  can  also  allow  the  measurement  processes  to  be 
actively  controlled  by  the  high-level  processes.  Although 
the  units  themselves  are  simple  and  perform  simple  convo¬ 
lutions  and  maximum  selection  operations,  by  controlling 
their  input  they  can  be  used  to  achieve  the  more  complex 
goals  of  the  higher-level  processes. 

As  an  example  of  the  use  of  the  energy  measurement 
units,  consider  the  spatial  organization  process.  First,  a 
rough  separation  of  rapidly  moving  areas  from  stationary 
areas  can  be  obtained  from  the  low-frequency  tuned  units. 
The  low-frequency  units  near  the  boundaries  of  motion 
will  have  low-confidence  information,  and  should  there¬ 
fore  be  ignored.  The  coarse-to-fine  strategy  can  be  ap¬ 
plied  selectively  to  the  units  in  the  moving  area  to  obtain 
high-frequency  measurements  of  movement,  while  simul¬ 
taneously  refining  the  segmentation.  As  the  refinement  of 
the  segmentation  progresses,  closer  attention  can  be  paid 
to  the  boundary  units,  whose  search  areas  may  be  rede¬ 
fined  according  to  the  motion  of  the  area  containing  them, 
As  much  as  possible,  the  process  of  spatial  organization 
(or  “grouping”)  should  progress  without  a  3-D  interpre¬ 
tation,  thereby  avoiding  the  various  numerical  instabili¬ 
ties  encountered  in  the  interpretation  process.  Once  the 
grouping  and  segmentation  is  finished,  the  high  resolution 
processes  can  provide  highly  accurate  measurements  of  mo¬ 
tion,  which  can  then  be  used  for  the  determination  of  the 
3-D  structure  and  the  3-D  motion  parameters. 

Any  scheme  for  the  use  of  these  energy  measurement 
units  should  also  include  an  understanding  of  how  these 
models  behave  under  a  tracking  situation,  i.e.,  when  the 
camera  (or  the  eye)  is  rotated  to  fixate  the  image  of  a 
particular  environmental  point  on  the  image-plane  (or  the 
retina).  Here,  it  is  not  sufficient  to  consider  the  output  of 
the  measurement-units  under  tracking.  A  comprehensive 
analysis  must  also  address  methods  of  using  the  measure¬ 


ments  to  trigger  the  tracking  mechanisms  and  schemes  for 
the  incremental  development  of  a  3-D  model  of  the  envi¬ 
ronment. 

Since  this  framework  is  new,  no  algorithm  completely 
consistent  with  it  exists  at  present,  although  Heeger’s  effort 
[18]  towards  using  the  energy  models  for  two-dimensional 
images  appears  to  be  a  step  in  that  direction.  In  our  own 
future  research,  we  intend  to  further  investigate  this  prob¬ 
lem  by  identifying  the  goals  of  the  organization  process, 
and  by  understanding  the  mathematical  relationships  of 
the  3-D  motion  parameters  and  the  geometry  of  the  envi¬ 
ronmental  surfaces  to  outputs  of  the  motion-detectors. 

8  Summary 

We  have  described  a  hierarchical  computational  framework 
for  the  determination  of  dense  displacement  fields  from  a 
pair  of  images.  The  framework  is  sufficiently  general  in 
order  to  unify  the  gradient-based  and  the  matching  tech¬ 
niques.  In  particular,  we  have  shown  that  three  successful 
current  techniques  [6,1 5,12]  are  consistent  with  our  frame¬ 
work.  We  have  also  shown  that  in  the  limit  the  approxima¬ 
tion  error  used  in  the  matching  technique  converges  to  the 
intensity  constraint  used  in  the  gradient-based  techniques. 

We  have  also  proposed  a  novel  connectionist  framework 
for  the  measurement  of  motion  which  unifies  the  energy 
models,  the  gradient-based  approaches,  and  the  matching 
approach.  A  comprehensive  examination  of  alternate  ap¬ 
proaches  for  the  analysis  of  visual  information  which  use 
our  new  framework  is  a  significant  research  effort  in  it¬ 
self  and  can  form  the  basis  of  future  work  in  this  area  of 
Computer  Vision. 

ACKNOWLEDGEMENTS 

The  author  wishes  to  thank  Profs.  Edward  Riseman 
and  Allen  Hanson  for  their  continued  advice  and  support 
through  the  course  of  this  work.  Thanks  are  also  due  to 
Dr.  George  Reynolds,  Prof.  Riseman,  and  Michael  Doldt 
for  their  comments  on  earlier  drafts,  to  Dr.  Richard  Weiss 
for  his  help  with  some  of  the  mathematical  aspects  of  this 
research,  and  Dr.  Mark  Snyder  for  teaching  me  some  fun¬ 
damental  concepts  in  linear  algebra.  Finally,  thanks  are 
due  to  the  members  of  the  UMass  VISIONS  group  for  cre¬ 
ating  a  unique  and  valuable  research  environment. 

References 

[l]  Adelson  E.  H.  and  Bergen  J.  R.  Spatiotemproal  en¬ 
ergy  models  for  the  perception  of  motion,  J.  Opt. 
Sot.  Am.  A  Vol.  2,  No.  2,  pp.  284-299,  1985. 

[2[  Adiv  G,  Determining  3-d  motion  and  structure  from 
optical  flows  generated  by  several  moving  objects, 
IEEE  T-PAMl,  7  (4),  pp.  384-401,  1985. 


730 


[3]  Adiv  G.,  Inherent  ambiguities  in  recovering  3-d  mo¬ 
tion  and  structure  from  a  noisy  flow  field,  Proc. 
CVPR,  pp.  70-77,  1985. 

[4]  Aggarwal  J.  K.,  Structure  and  motion  from  images: 
fact  and  fiction,  Proc.  of  the  third  Workshop  on  Com¬ 
puter  Vision:  Representation  and  Control,  pp.  127- 
128,  1985. 

[5]  Anandan  P.,  Computing  dense  displacement  fields 
with  confidence  measures  in  scenes  containing  occlu¬ 
sion,  SPIE  Intelligent  Robots  and  Computer  Vision 
Conference,  Vol.  521,  pp  184-194,  1984,  also  COINS 
Technical  Report  8f-SS,  University  of  Massachusetts, 
December  1984. 

[6]  Anandan  P.  and  Weiss  R.,  Introducing  a  smoothness 
constraint  in  a  matching  approach  for  the  compu¬ 
tation  of  displacement  fields,  DARPA  IU  Workshop 
Proc.,  pp.  186-196,  1985. 

[7]  Anandan  P.,  Measuring  visual  motion  from  image 
sequences,  Ph.D.  dissertation,  COINS  Department, 
University  of  Massachusetts,  Amherst,  Mass.,  in 
preparation. 

[8]  Burt  P.  J.,  Yen  C.  and  Xu  X.,  Local  correlation 
measures  for  motion  analysis:  A  comparative  study, 
IEEE  Proc.  PRIP,  269-274,  1982. 

[9]  Burt  P.  J.,  Fast  filter  transforms  for  image  process¬ 
ing,  Computer  graphics  and  image  processing ,  16, 
pp.  20-51,  1981. 

[10]  Burt  P.  J.,  Yen  C.  and  Xu  X.,  Multi-resolution  flow¬ 
through  motion  analysis,  IEEE  CVPR  Conference 
Proceedings,  June  1983,  pp.  246-252. 

[11]  Crowley  J.  L.,  and  Stern  R.  M.,  Fast  computations 
of  the  difference  of  low-pass  transform,  IEEE  Trans¬ 
actions  on  PAMl,  vol.  PAMI-6,  pp.  212-222,  1984. 

[12]  Enkelmann,  W.,  Investigations  of  multigrid  algo¬ 
rithms  for  the  estimation  of  optical  flow  fields  in 
image  sequences,  Proc.  of  the  Workshop  on  Motion: 
Representation  and  Control,  S.  Carolina,  pp.  81-87, 
1986. 

[13]  Fang  J.  and  Huang  T.  S.,  Some  experiments  on  es¬ 
timating  the  3-d  motion  parameters  of  a  rigid  body 
from  two  consecutive  Image  Frames,  IEEE  Transac¬ 
tions  on  Pattern  Analysis  and  Machine  Intelligence, 
vol.  PAMI-6,  NO.  5,  pp  545-554,  September,  1984. 

[14]  Glazer  F.,  Reynolds  G.  and  Anandan  P.,  Scene 
matching  by  hierarchical  correlation,  IEEE  CVPR 
conference,  June  1983,  pp.  432-441. 


[15]  Glazer  F.,  Hierarchical  motion  detection,  Ph.D.  dis¬ 
sertation,  COINS  Department,  University  of  Mas¬ 
sachusetts,  Amherst,  Ma.,  February  1987. 

[16]  Grimson  W.  E.  L.,  Computational  experiments  with 
a  feature  based  stereo  algorithm,  IEEE  transac¬ 
tions  on  Pattern  Analysis  and  Machine  Intelligence , 
Vol.  PAMI-7,  No.  1,  pp.  17-34,  1985. 

[17]  Hanson  A..  R  and  Riseman  E.  M.,  Processing 
cones:  A  computational  struture  for  image  analy¬ 
sis,  in:  Structured  computer  vision  Tanimoto  S.  and 
Klinger  A.  (Eds.),  Academic  Press,  New  York,  1980. 

[18]  Heeger  D.,  and  Pentalnd  A.,  Seeing  structure 
through  chaos,  Proc.  of  the  Workshop  on  motion: 
Representation  and  Control,  S.  Carolina,  pp.  131- 
136,  1986. 

[19]  Hildreth  E.  C.,  The  measurement  of  visual  motion, 
Ph.D.  dissertation,  Dept,  of  Electrical  Engineering 
and  Computer  Science,  MIT,  Cambridge,  Ma.,  1983. 

[20]  Horn  B.  K.  P.,  and  Schunck  B.  G.,  Determining  Opti¬ 
cal  Flow,  Artificial  Intelligence,  vol.  17,  pp.  185-203. 

[21]  Klinger  A.,  and  Dyer  R.  D.,  Experiments  on  picture 
respresentations  using  regular  decomposition,  Com¬ 
puter  graphics  and  image  processing,  vol  5,  no.  1, 
pp.  68-105,  1976. 

[22]  Lucas  B.  D.,  and  Kanade  T.,  An  iterative  image  reg¬ 
istration  technique  with  an  application  to  stereo  vi¬ 
sion,  Proc.  7th  IJCAI,  Vancouver,  B.  C.,  Canada, 
pp.  674-679,  1981. 

[23]  Marr  D.  and  Poggio  T.,  A  computational  theory  of 
human  stereo  vision,  Proc.  Roy.  Soc.  London,  Ser. 
B,  204,  pp  301-308,  1979. 

[24]  Mayhew  J.  E.  W.  and  Frisby  J.  P.,  Psychophysi¬ 
cal  and  computational  studies  towards  a  theory  of 
human  stereopsis,  Artificial  Intelligence,  Vol.  17, 
pp.  349-385,  1981. 

[25]  Miller,  R.  and  Q.  F.  Stout,  Geometric  algorithms 
for  digitized  pictures  on  a  mesh-connected  computer. 
IEEE  T-PAMI,  vol.  PAMI-7,  pp.  216-228,  1985. 

[26]  Moravec  H.  Robot  rover  visual  navigation,  UMI  Re¬ 
search  press,  Ann  Arbor,  Michigan,  1981. 

[27]  Nagel  H.  H.,  Constraints  for  the  estimation  of  dis¬ 
placement  vector  fields  from  image  sequences,  Proc. 
IJCAI,  Karlsruhe  /  FRG,  pp.  945-951,  1983. 

[28]  Nagel  H.  H.,  Displacement  vectors  derived  from  sec¬ 
ond  order  intensity  variations  in  image  sequences, 
CVGIP,  vol.  21,  pp.  85-117,  1983. 


731 


[29|  Nagel  H.  H.  and  Enkelmann  W.,  An  investigation 
of  smoothness  constraints  for  the  estimation  of  dis¬ 
placement  vector  fields  from  image  sequnces,  IEEE 
transactions  on  PAMI,  vol.  PAMI-8,  pp.  565-593, 
1986. 

[30]  Nagel  H.  H.,  Image  sequences  -  Ten  (Octal)  years  - 
from  phenomenology  towards  a  theoretical  founda¬ 
tion,  Proc.  of  Eighth  ICPR,  Paris,  France,  1986. 

[31]  Rieger  J.  H.  and  Lawton  D.  T.,  Determining  the  in¬ 
stantaneous  axis  of  translation  from  optic  flow  gen¬ 
erated  by  arbitrary  sensor  motion,  Proc.  ACM  Sig- 
groph/Sigart  Interdisciplinary  Workshop  on  Motion, 
Toronto,  pp.  33-41,  1983. 

[32]  Tanimato  S.  L.  and  Pavlidis  T.,  A  hierarchical  data 
structure  for  picture  processing,  CGIP,  vol.  4,  no.  2, 
pp.  104-119,  1975. 

[33]  Terzopoulos  D.,  Mutiresolution  Computation  of 
Visible-Surface  Representations,  Phd  Dissertation, 
Massachusetts  Institute  of  Technology,  Jan.  1984. 

[34]  Terzopolous  D.,  Image  analysis  using  multi¬ 
resolution  methods,  IEEE  transactions  on  Pattern 
Analysis  and  Machine  Intelligence,  Vol.  PAMI-8, 
No.  2,  pp.  129-139,  1986. 

[35]  van  Santen  J.  P.  H.  and  Sperling  G.,  Elaborated  Re- 
ichhardt  detectors,  J.  of  Opt.  Soc.  Am.,  vol.  2,  no.  7, 
pp.  300-321,  1985. 

[36]  Watson  A.B.  and  Ahmuda  A.  J.,  Model  of  human 
visual-motion  sensing,  J.  of  Opt.  Soc.  Am.,  vol.  2, 
no.  7,  1985. 

[37]  Waxman  A.,  An  image-flow  paradigm,  Proc.  Work¬ 
shop  on  computer  vision,  Annapolis,  MD,  pp.  49-57, 
1984. 

[38]  Wong  R.  Y.  and  Hall  E.  L.,  Sequential  hierarchical 
scene  matching,  IEEE  transactions  on  Computers, 
Vol.  27,  No.  4,  pp.  359-366,  1978. 


732 


Hierarchical  Gradient-Based  Motion  Detection 

Frank  Glazer 


Computer  and  Information  Science  Department 
University  of  Massachusetts 
Amherst,  Massachusetts  01003 


Abstract 

A  major  class  of  techniques  for  the  computation  of  im¬ 
age  motion  is  that  of  gradient-based  methods.  The  applica¬ 
tion  of  gradient-based  methods  is  limited  to  cases  of  small 
motions  (a  few  pixels). 

In  this  report  we  demonstrate  that  gradient-based  meth¬ 
ods  may  be  extended  to  the  computation  of  large  dispari¬ 
ties  by  formulating  a  hierarchical  generalization  of  the  single 
level  method  that  operates  on  low-pass  image  pyramids.  A 
thorough  formal  analysis  of  the  gradient-based  updating 
equations  is  presented  as  well  as  the  implementation  of  the 
method  in  a  hierarchical  processing  cone  algorithm.  Exper¬ 
iments  are  presented  both  to  show  how  single  level  methods 
fail  and  how  the  hierarchical  method  succeeds. 


1  Introduction 

Motion  analysis  can  be  used  in  machine  vision  sys¬ 
tems  to  determine  the  location,  movement,  and  iden¬ 
tity  of  objects.  The  first  stage  in  such  analysis  is  the 
computation  of  image  motion  —  the  motion  of  compo¬ 
nents  of  a  dynamic  image  sequence.  One  major  class 
of  techniques  for  the  computation  of  image  motion  is 
that  of  gradient- baaed  methods  (Fennema  Si  Thompson  79, 
Glazer  81,  Horn  Si  Schunck  81).  These  methods  are  char¬ 
acterized  by  (1)  the  approximate  representation  of  local  im¬ 
age  neighborhoods  by  low  order  polynomials;  and  (2)  the 
derivation  of  constraints  on  the  possible  image  motions  in 
terms  of  gradients  of  the  image  function,  both  spatial  and 
temporal.  Image  motions  are  determined  by  combining 
constraints  from  multiple  image  locations. 

This  report  describes  a  hierarchical  approach  to 
gradient-based  motion  computation.  It  generalizes 
gradient-based  techniques  to  handle  cases  for  which  the 


disparities  between  points  in  two  images  are  more  than 
a  few  pixels  in  length.  The  hierarchical  control  is  very 
similar  to  that  used  for  hierarchical  correlation  match¬ 
ing  [Glazer  et  ol.  83,  Anandan  84,  Glazer  87).  Coarse  es¬ 
timates  of  disparity  at  high  levels  of  the  processing  cone 
are  used  at  lower  levels  to  guide  gradient-based  algorithms 
which  compute  increasingly  accurate  (higher  resolution) 
disparity  vectors. 

This  report  describes  one  part  of  the  author’s  thesis 
on  hierarchical  motion  detection  [Glazer  87).  The  other 
major  topics  addressed  there  are  (1)  fast  construction  of 
image  pyramids;  (2)  hierarchical  correlation-based  motion 
detection;  and  (3)  multilevel  relaxation  for  optic  flow  com¬ 
putation.  In  particular,  regarding  the  first  topic,  a  family 
of  diaerete  Gaussian  low-pass  filters  for  building  low-pass 
pyramids  is  presented  that  provides  good  anti-aliasing  char¬ 
acteristics,  efficient  computation,  and  a  good  hierarchical 
Gaussian  approximation  (in  the  sense  of  Burt’s  hierarchi¬ 
cal  filters  [Burt  81]).  These  filters  are  used  to  build  the 
low-pass  pyramids  used  in  the  experiments  presented  in 
this  report.  All  of  the  motion  computation  methods  pre¬ 
sented  in  [Glazer  87),  including  hierarchical  gradient-based 
algorithms,  are  designed  for  a  hierarchical  processing  archi¬ 
tecture  consisting  of  the  processing  cone — a  general  com¬ 
putational  architecture  which  is  regular,  parallel  and  hi¬ 
erarchical  |Hanson  Si  Riseman  74,  Hanson  Si  Riseman  80); 
and  image  pyramids — multiresolution  image  representa¬ 
tions  [Tanimoto  Si  Pavlidis  75,  Burt  82). 

The  success  of  the  gradient-based  methods  depends  on 
the  assumption  that  the  image  value  (intensity)  changes 
in  local  neighborhoods  can  be  adequately  approximated  by 
the  low  order  terms  of  their  Taylor  series  expansions.  In 
particular,  first  order  methods  depend  on  the  degree  to 
which  image  structure  is  locally  linear.  This  assumption 
is  not  the  case  when  (1)  the  optic  flow  (disparity)  is  rela¬ 
tively  large,  or  (2)  the  extent  over  which  the  intensity  varies 


Research  supported  in  part  by:  DARPA  grant  N00014-82-K-0464 

Author’s  current  address:  Digital  Equipment  Corporation,  77  Reed  Rd.  (HL02-3/M8),  Hudson,  MA  01749 


733 


linearly  is  small.  In  either  case,  experiments  presented  later 
in  this  chapter  show  that  a  gradient-based  algorithm  oper¬ 
ating  at  a  single  level  of  resolution  will  fail  to  detect  correct 
disparities.  Figures  8  and  9  show  a  progression  of  good  to 
bad  results  for  a  sequence  of  image  pairs  with  increasing 
disparities. 

A  little  more  formally,  consider  the  following  first  order 
approximation  used  in  first  order  gradient-based  methods 

F(x  +  U,y  +  V)  =  F(x,y)  +  VF-(U,V)  (1) 

Let  r  be  the  radius  about  (x,y)  within  which  this  approx¬ 
imation  holds  (with  some  error  tolerance  e).  To  use  this 
equation  in  the  computation  of  a  disparity  vector  (U,V), 
we  need  ([((/,  V)([<  r.  This  condition  may  be  violated  in 
two  ways:  (1)  ||(f/,  V)||  is  too  large, i.e.  large  disparities;  or 
(2)  r  is  too  small:  strong  second  order  (or  higher)  image 
structure. 

The  first  problem  can  be  overcome  if  a  coarse  estimate 
of  disparity  is  known  and  only  an  update  need  be  com¬ 
puted.  If  the  coarse  disparity  estimate  is  a  good  one  then 
the  required  update  vector  is  small  and  will  lie  in  the  “lin¬ 
ear"  neighborhood.  This  suggests  a  coarse-to-fine  approach 
to  disparity  computation.  The  second  problem  is  avoided 
by  eliminating  higher-order  image  structure.  Such  elimina¬ 
tion  of  fast  variations  in  the  image  can  be  done  by  low  pass 
filtering.  Of  course  this  smoothing  will  eliminate  detail  in 
the  frames  and  lower  the  resolution  of  the  computed  dispar¬ 
ity  field.  The  accuracy  of  disparity  values  is  reduced  and 
fast  variations  are  lost.  In  a  sense  the  disparity  has  itself 
been  smoothed.  This  suggests  that  disparity  computations 
on  smoothed  data  can  be  performed  on  subsampled  image 
data. 

The  combination  of  coarse-to-fine  stages  with  subsam¬ 
pling  at  the  coarser  stages  leads  us  to  conclude  that  first  or¬ 
der  gradient-based  methods  may  be  extended  to  non-trivial 
disparity  computations  by  formulating  a  hierarchical  gen¬ 
eralization  of  the  single  level  method  that  operates  on  low- 
pass  image  pyramids.  “First  order”  gradient-based  meth¬ 
ods  are  those  that  only  involve  first  order  partial  derivatives 
of  the  image.  (By  “non-trivial”  we  mean  disparities  more 
than  just  a  few  pixels  in  length.) 

In  the  hierarchical  method,  approximate  disparities  are 
refined  at  each  level  of  the  processing  cone  in  a  coarse-to- 
fine  sequence.  At  each  level,  the  update  vectors  are  ex¬ 
pected  to  be  of  bounded  magnitude,  relative  to  the  image 
scale  (pixel  spacing)  at  that  level.  This  constraint  provides 
a  consistency  check  which  can  be  used  to  “Site*”  out  bad 
update  vectors  that  might  result  from  noisy  image  data  or 
violations  of  the  method’s  assumptions. 

The  next  section  summarizes  the  hierarchical  method  of 
applying  gradient- based  disparity  computation.  Section  3 
presents  the  technical  details.  Then,  in  Section  4,  a  hier¬ 
archical  gradient-based  algorithm  is  briefly  described.  Ex¬ 
periments  in  Section  5  show  how  single  level  methods  fail 
and  how  the  hierarchical  method  succeeds. 


2  The  Hierarchical  Method 

Initially  we  are  given  two  input  images  Ft(i,j)  and  Ft(i,j) 
for  which  the  disparity  between  them  must  be  com¬ 
puted.  The  disparity  will  be  represented  as  a  vector  field 
(s.f)  =  (s(«',j),t(»,j))  such  that  a  given  disparity  vector 
(a(i,j),t(i,j))  describes  the  displacement  between  a  point 
(i,j)  in  Fi  and  its  “match"  at  (t  -t-  a,j  +  t)  in  Fj.  Low 
pass  image  pyramids  7X  and  7j  are  formed  for  each  of  the 
two  frames  [Giazer  87,  Chapter  2] .  Processing  begins  at  a 
l^vel  Itop,  high  enough  in  the  pyramid  so  that  the  maximum 
disparity  is  less  than  1  pixel  in  magnitude.  We  assume  that 
the  coarse  disparity  field  at  level  ltof  -  1  is  (0,0),  that  is 
(0,0)  at  all  pixels. 

The  following  processing  is  performed  in  a  coarse-to-fine 
succession  of  stages,  beginning  at  level  ltop  and  proceeding 
down  the  pyramid  (increasing  level  numbers)  to  the  finest 
level.  First,  the  coarse  disparity  field  at  level  A  —  1  is  pro¬ 
jected  down  to  level  fc,  giving  an  approximate  disparity 
field  at  that  level.  Then  using  the  two  frames  fi  =  7X 
and  ft  =  If  at  level  fc  in  the  image  pyramids  and  the 
approximate  disparity  field  {U,V)  =  (U(i,j),V(i,j)),  an 
update  vector  field  (u,u)  =  (u(i, j),  u(t, /))  is  computed. 
The  update  vector  field  is  added  to  the  approximate  dis¬ 
parity  field  to  give  the  more  accurate  updated  disparity 
field  (If  +  u,  V  +  t>).  The  update  vector  field  is  computed 
at  each  point  (i,j)  by  comparing  data  in  the  neighbor¬ 
hood  of  /|(»,j)  to  data  in  the  corresponding  neighborhood 
ft(i  +  UJ  +  F).  The  update  vector  field  (u,t>)  is  the  rela¬ 
tive  translation  between  these  two  neighborhoods.  It  can  be 
computed  using  a  gradient-based  method.  The  relationship 
between  these  vectors  at  a  given  pixel  is  shown  in  Figure  1. 

The  gradient-based  algorithm  uses  two  stages.  In  the 
first  stage  the  edge  flow  is  computed  at  each  pixel.  It  is 
the  component  of  the  update  vector  parallel  to  the  mean 
spatial  gradient  at  fi(i,j)  and  /2(«  +  U,j  +  F).  The  edge 
flow  vector,  along  with  the  approximate  disparity,  defines 
a  constraint  line  which  is  a  locus  of  points  upon  which  the 
update  vector  lies.  In  Figure  1  the  edge  flow  vector  is  shown 
as  (u0,i>o)- 

Various  methods  of  computing  optic  flow  from  edge 
flow  are  reviewed  in  [Giazer  87,  Chapter  3].  These  in¬ 
clude:  global  histogramming  using  a  modified  Hough  trans¬ 
form  [Fennema  Sc  Thompson  79);  computing  the  “pseudo- 
intersection”  (a  least  square  error  fit)  of  a  set  of  constraint 
lines  [Giazer  81);  a  global  optimization  method  that  com¬ 
putes  a  smooth  motion  field  which  satifies  the  gradient- 
based  constraints  at  each  pixel  [Horn  Sc  Schunck  81, 
Yachida  81,  Cornelius  Sc  Kanade  83,  Nagel  83];  and  con¬ 
straint  line  clustering  [Schunck  83).  We  will  use  an  opti¬ 
mization  method.  The  next  section  presents  a  general  de¬ 
velopment  of  such  a  method  for  use  in  a  hierarchical  scheme 
that  utilizes  approximate  disparity  information. 


734 


Figure  1:  Geometry  of  updating  process 

The  refinement  of  the  disparity  estimate  at  a  given  point 
in  the  first  frame  involves  the  following  vectors:  (U,V) 
is  the  approximate  disparity  vector;  (u,  e)  is  an  update 
vector;  and  (U  4  u,  V  4  u)  is  the  updated  disparity.  The 
edge  flow  (uo,  vq)  is  the  component  of  (u,  v)  parallel  to 
the  mean  spatial  gradient  of  the  images  at  the  points 
being  compared.  It  defines  a  constraint  line  which  is  the 
locus  of  points  upon  which  (u,u)  is  found. 


3  Formal  Development 

3.1  Computing  a  Refined  Disparity  Field 

Let  Fi(x,y)  and  Fj(x,y)  be  two  frames  of  an  image  se¬ 
quence  and  let  {U,V)  =  (U(x,y),V  (x,y))  be  a  vector  field 
approximately  describing  the  disparity  from  Ft  to  F2.  We 
want  to  find  a  vector  field  (u,v)  =  (u(x,y),v(x,y))  which 
“updates"  (U,V),  i.e.  the  new  vector  field  (U  4u,  V  4  v)  is  a 
better  approximation  of  the  disparity  from  Fj  to  F2.  In  the 
following  development  we  will  use  the  notation  x  =  (x,y), 
U  =  ({/,  V)  and  u  =  (u,v). 

3.1.1  The  Constraint  Line 

First  we  will  find  a  constraint  line  for  u.  We  define  the 
change  in  image  value  between  a  point  x  in  F\  and  a  cor¬ 
responding  point  in  F2  at  the  displaced  point  x  4  U  +  u 

as: 

AF(x)  =F,(x  +  U  +  u)-F,(x)  (2) 

Consider  the  following  two  ways  in  which  a  Taylor  series 
expansion  of  F2  can  be  computed: 

Fj(x  4  U  4  u)  =  F,(x  4  U)  4  u  •  G,(x  4  U)  (3) 
+  ^ur(ff,(x4U))u 

+  higher  order  terms 


and 

Fs(x  +  U)  (4) 

=  F2(x  +  U  +  u)  +  (-u)  •  G2(x  +  U  +  u) 

+  ^(-u)r(/f2(x  +  U  +  u))(-u) 

4  higher  order  terms 

where  G2  is  the  gradient  of  F2  and  /f2  is  the  Hessian  of 
F2.  The  gradient  and  the  Hessian  of  a  two  dimensional 
image  are  generalizations  of  the  first  and  second  derivative 
of  function  of  one  variable.  The  last  two  equations  can  be 
combined  to  give: 

F2(x  4  U  4  u)  (5) 

—  F2(x  +  U)  I  u  ■  ^  [g2(x  4  U)  -f-  G2(x  -f  U  4  u)j 

+  ^uri[j if2(x  +  U)  -  H2(x  4  U  4  u)]u 
-(-  third  and  higher  order  terms 

If  U  +  u  varies  slowly,  i.e.  |V(U  4u)|2  is  small,  then  the 
following  approximations  can  be  made:  G2(x  4  U  +  u)  = 
Gj(x)  and  H2(x  -I-  U  +  u)  =  //i(x)  (|Glazer  87,  Ap¬ 
pendix  D]).  We  then  get: 

F2(x  +  U  +  u)  (6) 

=  F2(x  +  U)  +  u  ■  -  [g2(x  +  U)  +  G^x)] 

+  luri[/f2(x  +  U)-H1(x)]u 

+  third  and  higher  order  terms 

Substituting  into  Equation  2  and  ignoring  higher  order 
terms  finally  gives: 

AF(x)  =  F^x+UJ-F.M+u.ifG^x+UHG.M]  (7) 

If  U  is  a  good  estimate,  then  u  will  be  small  and  hence  drop¬ 
ping  higher  order  terms  leaves  us  with  a  good  approxima¬ 
tion.  In  fact  the  second  order  term  is  very  small  due  to  the 
fact  that  //2(x-l-U)  is  very  close  to  H i(x)  «  ff2(x  +  U  +  u). 
That  is  why  we  used  both  of  the  Taylor  series  expansions 
in  Equations  3  and  4  and  not  just  one  of  them. 

To  simplify  our  equations  we  define 

F2(x)  =  F2(x  +  U)  =  F2(x  +  U(x))  (8) 

and 

F^(F,+F2)  (9) 

Then  the  gradient  of  F  is  equal  to  VF  =  (F„F,)  =  J(G!  4 
G2)^  where  G)  is  the  gradient  of  Ft  and  G2  is  the  gradient 
of  F2.  We  can  then  write 

AF(x)  =  F2(x)-F1(x)4u-(Fi,F,)  (10) 


735 


If  we  set  A F  to  zero  then  equation  10  defines  a  con¬ 
straint  line  upon  which  the  update  vector  u  must  lie. 

Two  key  assumptions  in  the  above  derivation  were  that 
(1)  U(x)  +  u(x)  varies  slowly,  and  (2)  second  and  higher 
order  terms  in  the  Taylor  series  expansion  can  be  ignored. 
The  first  assumption  does  not  hold  at  motion  boundaries. 
This  limitation  is  the  same  as  that  which  occurs  when 
gradient-based  flow  computations  are  performed.  Hence, 
similar  solutions,  whatever  they  might  be,  can  be  used. 

The  (first  order)  Taylor  series  approximation  only  holds 
within  some  local  neighborhood.  If  the  disparity  is  greater 
than  the  “radius”  of  this  neighborhood  then  the  method 
doesn’t  work.  The  second  assumption  is  warranted  because 
the  approximate  disparity  field  U(x)  allows  us  to  com¬ 
bine  higher  order  components  from  approximately  equiv¬ 
alent  image  neighborhoods.  This  is  a  key  element  of  the 
hierarchical  method.  Imagine  the  above  derivation  was 
performed  for  the  non-hierarchical  case.  This  is  easily 
done  by  setting  U  =  (0,0),  that  is,  eliminate  all  ap¬ 
pearances  of  U  in  Equations  2  to  10.  The  second  or¬ 
der  terms  in  Equations  5  and  6  then  contain  the  factors 
[//2(x)  -  fij(x  +  u)|  «  [hTj(x)  —  tfi(x)] .  These  terms  will 
NOT  be  negligible  when  the  actual  disparity  is  large. 

Theoretically,  the  second  assumption  does  not  pose  a 
problem  for  the  case  of  flow  computation  which  is  formu¬ 
lated  in  a  continuous  three  dimensional  XYT  space.  How¬ 
ever,  in  real  imaging  systems  we  are  actually  processing 
discrete  frames  taken  with  a  non-trivial  At  and  approxi¬ 
mating  derivatives  with  finite  differences.  This  introduces 
a  non-infinitesimal  jump  which  may  exceed  the  radius  of 
approximation. 


3.1.2  Edge  flow 


We  define  edge  flow  as  the  point  on  the  constraint  line 
lying  nearest  the  origin.  The  edge  flow  is  then 


,,  Cl 

l*o  = 


-(F,  -  F|)  VF 
|VF|  \VF\ 

( -  r,(fi  -  Ft)  -T,(h-F>)} 
{  Fl  +  Fl  ’  Fl  +  F l  ) 


(11) 


This  vector  is  used  as  the  initial  estimate  of  the  update 
vector  u. 

Both  u  and  u0  lie  on  the  constraint  line  and  Uo  is  that 
point  on  the  line  nearest  the  origin.  Thus  Uo  must  be 
smaller  (in  magnitude)  than  u,  that  is  ||uo||<||u||.  In  a  hi¬ 
erarchical  scheme,  approximate  disparity  vectors  U  insure 
us  that  the  update  vector  must  be  bounded  in  magnitude. 
This  in  turns  places  a  bound  on  the  magnitude  of  the  edge 
flow  vector.  In  Section  4.4  this  fact  is  used  to  introduce  a 
consistency  check  on  the  computed  edge  flow.  This  is  an 
added  benefit  of  the  hierarchical  method. 


3.1.3  Computing  the  Update:  An  Optimization 
Problem 

Equation  10  provides  a  constraint  on  the  value  of  the  up¬ 
date  vector  at  each  point  in  the  image  space  (at  a  given 
level).  The  assumption  that  the  change  in  image  value  at 
the  matching  points  vanishes,  i.e.  A F  =  0,  specifies  a 
line  in  “update  vector”  space  upon  which  the  update  vec¬ 
tor  should  lie.  Due  to  noise  in  either  image,  this  is  likely 
to  be  too  strong  a  constraint.  A  more  practical  constraint 
requires  that  the  update  vector  lie  as  close  to  the  line  as 
possible. 

In  the  second  stage  of  a  first  order  method,  other  infor¬ 
mation  is  brought  to  bear  to  finally  arrive  at  update  vectors 
at  each  pixel.  In  the  motion  problem  we  have  posed,  dis¬ 
parity  vector  fields  can  vary  across  the  image,  but  locally 
they  are  almost  constant.  Thus  we  must  use  local  dispar¬ 
ity  information  to  arrive  at  a  final  update  value  at  a  given 
pixel.  We  will  use  an  optimization  method,  which  while 
formulated  as  a  global  optimization  problem,  can  be  repre¬ 
sented  as  a  set  of  local  constraints  and  can  be  solved  by  a 
local,  parallel,  uniform  algorithm.  Horn  and  Schunck  first 
applied  such  a  method  to  the  computation  of  optic  flow 
from  edge  flow  |Horn  &  Schunck  81).  The  analysis  here  is 
similar,  using  image  differences  in  place  of  partial  deriva¬ 
tives  with  respect  to  time  and  replacing  the  optic  flow  field 
by  an  approximate  disparity  field  summed  with  an  update 
vector  field. 

The  variational  problem  we  wish  to  solve  asks  for  a  dis¬ 
parity  update  field  for  which  (1)  the  update  vector  lies  as 
close  to  the  constraint  line  as  possible,  and  (2)  the  dispar¬ 
ity  field  is  as  smooth  as  possible.  Since  “close” -ness  and 
“smooth” -ness  are  measured  in  different  units,  a  constant 
of  proportionality  (a  in  the  equation  below)  must  be  in¬ 
cluded  as  a  parameter  to  relate  these  two  values. 

The  problem  then  posed  is:  given  an  approximate  dis¬ 
parity  field  (l/,  V),  and  image  gradients  Fz,  F„,  Ft  -  Fi,  find 
an  update  vector  field  (u,t>)  which  minimizes  the  functional 

ff[F^+Ftv  +  (F2-Fl)  ]*  (12) 

+  a5 ||V (U  +  u)|*  +  |V(V  +  v)|J]  dxdy 

The  first  term  in  the  functional  is  the  square  of  the  change 
of  image  value,  as  it  is  defined  by  Equation  10.  The  second 
term  is  a  roughness  measure  of  U  +  u  and  V  +  v,  where  V 
is  the  gradient  operator  and  a  a  relative  weighting  factor. 
Note  that  it  is  the  disparity  field  (U  +  u,  V  +  v)  which  is 
required  to  be  smooth  and  not  just  the  update  field  (u,  v). 
This  is  very  important  in  the  implementation  algorithm 
(Section  4),  which  truncates  the  disparity  vector  as  it  is 
passed  down  a  level  in  the  cone. 


736 


The  equivaleni  PDE  system  (Euler’s  equations, 
(Courant  ii  Hilbert  53,  Section  IV.3.4])  is 

a2A(U  +  u)  T\u  -  FzF,v  =  Fz(h  -  F,)  (13a) 
a’A(V  +  v)  -  FzFyu  -  T]v  =  F,(F,  -  F,)  (13b) 

These  equation-^  involve  first  and  second  partial  derivatives 
in  the  image  d;>ta  and  the  vector  fields.  This  is  purely  local 
information,  that  is,  at  a  given  point  the  partial  derivatives 
of  a  scalar  or  vector  field  are  determined  by  their  values  in 
the  immedial  e  neighborhood  of  the  point.  Thus  the  discrete 
solution  to  these  equations  presented  in  the  next  section  is 
a  local  computation. 


3.2  Discrete  Representation 
and  Computation 

Update  vectors  will  be  found  by  solving  Equation  13.  The 
edge  flow,  given  in  Equation  11,  will  provide  an  initial  esti¬ 
mate.  To  do  this  we  must  formulate  a  discrete  representa¬ 
tion  of  these  equations.  Finite  difference  approximations  to 
the  “continuous”  derivative  operators  will  be  used.  Equa¬ 
tion  1.1  then  becomes  a  system  of  linear  equations  in  the 
variables  (u^.v.y)  ~  (u(i,j),v{i,j)).  These  equations  are 
solved  using  an  iterative  relaxation  scheme  which  is  local, 
uniform,  and  parallel —  criteria  we  require  to  be  satisfied 
for  efficient  implementation  in  a  cellular/processing-cone 
architecture  (Glazer  87|. 


3.2.1  Partial  Derivative  Operators 

F.quations  11  and  13  will  be  represented  by  discrete  finite 
difference  methods.  We  do  this  with  the  following  first  dif¬ 
ference  operators: 


4[-l  0  11* -[1  2  l)r  = 
2/i  1  4  1 


1 

8 


-1  0  1 

-2  0  2 

-1  0  1 


^^112  11^-10  1 


-1  -2  -1 

0  0  0 

1  2  1 


These  operators  were  chosen  to  combine  a  first  central 
difference  in  the  direction  of  the  partial  derivative  and 
smoothing  (low-pass  filtering)  in  the  perpendicular  direc¬ 
tion  (Glazer  87,  Chapter  V,  Section  3.2.1].  Note  that,  if  we 
ignore  the  1/8  scaling  factor,  these  operators  are  the  Sobel 
edge  detectors  used  early  on  in  machine  vision. 


Using  these  operators,  Equation  11  is  approximated  by 
-jS.F(Pt-Fl)  -jSvF(FJ-Fl)  \ 

M2) 


uo,t'°  ((£W + (*W  ’  (*W  +  (fa 
-h.6j{h  -  Ft)  -h6vF(Ft  -  f\) 

6ZF 2  +  6„FJ  ’  btF2  +  6„F2 

Using  the  operators  in  Equation  13,  (and  multiplying 
through  by  h2)  gives  the  approximations 


(14) 


,(£/  +  u)  -  ( 6zF)2u  -  (SzF)(6vF)v  (15a) 

=  h6zF(F2  -  F,) 

aJA,(U  +  v)  -  (6zF){6yF)u  -  (6yF)2v  (15b) 

=  h6yF(F2  -  B\) 


3.2.2  Normalized  Coordinate  Systems 

The  base  coordinate  system  is  the  coordinate  system  in 
which  the  inter-pixel  spacing  (grid  spacing)  is  1  unit  at  the 
base  level — the  level  L  of  the  input  images.  Using  this 
coordinate  system,  at  level  k  the  grid  spacing  is  hk  =  2L~k . 
Equations  14  and  15  involve  the  grid  spacing  h.  At  the 
lowest  level,  h  —  1  and  it  can  be  removed  from  the  equa¬ 
tions.  The  h  factor  can  also  be  removed  from  consideration 
at  all  other  levels  of  the  pyramid  if  we  use  a  normalized 
coordinate  system  at  each  level.  In  the  normalized  co¬ 
ordinate  system  at  a  given  level,  the  distance  between 
two  adjacent  pixels  is  1  unit.  Image  operators  acting  at  a 
single  level  can  be  computed  without  using  hk.  However, 
intra-level  computations  and  comparisons  must  then  take 
into  account  the  difference  in  the  grid  spacings  as  given  by 
the  ratio  r  =  hk-xlhi,. 

Consider  for  example  the  edge  flow  equations  (Equa¬ 
tion  14).  If  (u0,uo)  is  the  edge  flow  in  base  coordinates 
and  (uo,vo)  is  the  edge  flow  in  normalized  coordinates  then 
/i(u0»  Vo)  —  (wo.wo).  Substitution  into  equation  14  gives 


(uo,iio) 


i-6zF(F2  -  F,)  -6vF(F2-Fx)\ 
{  6zF2  +  6yF 2  ’  SZF2  +  bvF2  J 


(16) 


Similarly  the  h  factor  can  be  removed  from  equation  15 
when  we  specify  that  (U,V)  and  (u,t>)  are  represented  in 
the  normalized  coordinate  system.  We  will  assume  from 
now  on  that  this  specification  has  been  made  and  so  we 
need  not  use  the  tilde  (*)  notation. 


737 


3.3.3  Relaxation  Equations 

Equation  15  can  be  solved  using  the  iterative  point- Jacobi 
method.  This  involves  the  iterative  application  of  the  fol¬ 
lowing  update  equations  derived  from  equation  15: 

u*+'  :=  u*  +  {0  -  U)  (17a) 

_  Ffz[Ffl(uk  +  U  -  U)  +  F(y(vk  +  V  -  V)  +  (F,  -  F,)] 
(a*  +  +  F*„) 

vk+l  :=  vk  -  (V  -  V)  (17b) 

_  +  U  -  U)  +  FSy(vk  +  V  -  V)  +  (ft,  -  F,)] 

(«’  +  +  F*„) 

where  ( L7 ,  V' )  =  (U(i,j),V(i,j))  are  local  averages  of 
(U,  V);  (uk,vk)  is  the  estimate  at  the  fcth  iteration;  (itk,vk) 
are  local  averages  of  (u*,  vk);  and  Fgz  =  6ZF  and  Ffv  =  6yF 
are  the  normalized  first  difference  operators. 


3.3  Geometric  Interpretation  of  Update 
Equations 


The  relaxation  update  equations  (17)  are  better  understood 
in  terms  of  their  geometric  interpretation.  They  can  be 
thought  of  as  entailing  two  steps  for  each  pixel:  (1)  compute 
the  average  disparity  in  the  local  neighborhood;  (2)  project 
this  value  towards  the  constraint  line.  The  second  step  is 
accomplished  by  picking  a  disparity  vector  that  lies  on  the 
line  segment  joining  the  average  disparity  and  its  projec¬ 
tion  onto  the  constraint  line.  The  choice  is  made  closer  to 
the  constraint  line  when  we  have  higher  confidence  in  the 
measurement  of  this  line,  namely  when  the  spatial  gradient 
at  this  pixel  is  high.  The  following  paragraphs  elaborate  on 
these  ideas. 

First  consider  the  case  for  which  the  approximate  dis¬ 
parity  (U,V)  =  0.  Equation  17  then  simplifies  to 


u 


*+i 


v 


t+i 


.t  Ffz\FfIuk  +  Fftvk  +  (F2  -  F)] 

(a’  +  K  +  K) 

.  FjFftuk  +  Ff,vk  +  (F2-FI)l 

V - - - 2 - 2 - 

(qJ  +  *1  +  ft) 


(18a) 

(18b) 


or  rearranged  into  a  vector  notation 


(«,»)* 


(u,0)‘ 


(u,0)k-( FfzlFfv)  +  (F2  -  F.) 

(<*J  +  ?t  +  ft) 


(19) 

(FfX,  F 


Note  that  since  (C7,  V )  =  0,  F2  =  Fj. 

The  geometry  of  this  simple  relaxation  updating  equa¬ 
tion  in  shown  in  Figure  2.  The  average  disparity  over  the 
local  neighborhood  of  an  image  point  is  shown  as  the  point 
(u,  C)k  in  (u,  v)  disparity  space.  The  constraint  line  is  shown 
with  its  equation  (u,  u)  •  (Fx,  F„)  +  (Fj  -  Ft)  =  0.  The  (per- 


Figure  2:  Geometry  of  simple  relaxation  updating 

(ti,  v)k  is  the  average  of  neighboring  pixel  flow  estimates 
at  iteration  k.  (ti,  v)k  -  (p,  q)  is  the  perpendicular  projec¬ 
tion  of  (ibO)*  onto  the  constraint  line.  This  is  the  point 
on  the  constraint  line  nearest  to  (u,  v)k.  The  new  flow 
estimate  (ti,  v)*+I  lies  between  (ti,  0)*  and  (u,  v)1  -  (p,  q). 


pendicular)  projection  of  (u,0)*  onto  the  constraint  line  is 
shown  as  (u,v)k  -  (p,  g)  where  (p,  q)  is  the  vector  from  that 
point  to  (tt,v)1.  It  is  easy  to  verify  that: 


(P.</)  = 


(fi,v)*-(F„Fy)  +  (F,-F,) 

f\  +  f] 


(Ft,Fv)  (20) 


Now  suppose  that  we  were  to  choose  an  update  vector 
(u,  t»)*+1  along  the  line  segment  joining  (u,  v)k  and  (u,t))1  - 
(p,  q)  using  a  linear  combination  of  these  two  vectors,  that 
is 


(u,r)‘+I  =  (1  -/)(«.*)* +  t((M)‘-(p, <?)] 

=  (u,v)k  -  t{p,q)  (21) 


where  t  6  [0, 1  ]  selects  a  position  between  the  two  end 
points.  When  Equation  20  is  substituted  into  this  equa¬ 
tion,  we  can  make  the  result  equivalent  to  Equation  19  if 
we  set 


t  = 


O:2  +  Fj  +  Fl 
l 


<*’+  I \(F',F„W 


1  + 


(ll(Fi?F„)||) 


(22) 


In  Figure  3,  t  is  graphed  as  a  function  of  the  magnitude  of 
the  spatial  gradient  |j F*,  F„  || .  We  see  that  when  the  gra¬ 
dient  is  small  (  goes  to  zero  and,  looking  at  Equation  21, 


738 


Figure  3:  Update  weighting  factor 


(u,r’)*+1  is  chosen  near  ( u,t>)* .  On  the  other  hand  as  the 
gradient  gets  very  large  t  goes  to  1.0  and  (u,u)*+1  is  cho¬ 
sen  near  (u,&)*  -  (p, q).  The  midway  point,  giving  equal 
weight  to  both  possibilities,  occurs  when  the  magnitude  of 
the  gradient  is  equal  to  a.  The  parameter  a  controls  the 
shape  of  this  function.  Larger  values  of  a  “stretch”  the 
curve  out  to  the  right  so  that  higher  values  of  the  spatial 
gradient  are  needed  to  pull  the  update  vector  towards  the 
constraint  line. 

Now  we  generalize  this  picture  to  the  case  of  hierar¬ 
chical  relaxation  updating.  Figure  4  shows  the  geometry 
of  updating  for  a  given  location  in  the  image.  First  note 
that  two  coordinate  systems  are  shown  in  this  diagram. 
The  main  coordinate  system,  labeled  “F,  is  the  (u,v)  space 
with  its  origin  at  the  point  corresponding  to  zero  disparity. 
This  corresponds  to  the  coordinate  space  shown  in  Figure  2. 
The  approximate  disparity  (U,V)  is  shown  with  its  tail  at 
the  origin  of  this  space.  The  secondary  coordinate  system, 
labeled  “IF,  is  the  (u,i>)  space  with  its  origin  at  the  ap¬ 
proximate  disparity.  Subscripts  of  |  or  n  are  used  to  specify 
which  space  is  being  used  for  a  given  set  of  coordinates  for 
a  point.  For  example,  the  approximate  disparity  can  be 
represented  as  (U,V)i  or  (0,0)n  and  the  updated  dispar¬ 
ity  as  (U  +  u*+1,F  +  v*+1)i  or  (u*+,,ti*+1)|j.  Because  the 
coordinate  spaces  differ  only  by  a  translation,  the  coordi¬ 
nate  representation  of  a  vector  (as  opposed  to  a  point)  is 
independent  of  the  choice  of  coordinate  space. 

The  selection  of  an  update  vector  again  involves  (1) 
computation  of  the  average  disparity  in  the  local  neigh¬ 
borhood,  and  (2)  projection  of  this  value  towards  the  con¬ 
straint  line.  The  average  disparity  is  the  average  of  the 
approximate  disparity  vector  summed  with  the  update 
vector  at  each  pixel  in  the  local  neighborhood.  This  is 
(u*  +  U,vk  +  V )|  =  |(ti,0)*  +  ^£/,K)|i  or  in  the  secondary 
coordinate  system  ((u,0)*  +  ({/,  F)  -  (U,V)]n.  The  con¬ 
straint  equation  is  defined  on  update  vectors  (u,u)  and 
hence  it  is  defined  in  the  secondary  space.  This  is  signi¬ 


Figure  4:  Geometry  of  hierarchical  relaxation  updating 


fied  in  the  figure  by  the  subscript  u  given  to  the  equation. 
Again  ( p .  q)  is  the  vector  pointing  to  the  average  disparity 
from  its  (perpendicular)  projection  on  the  constraint  line. 
In  this  case  it  is  given  by 


(p<q)  = 

|(u,0)*  +  (U,F)-(U,V)]-(FI,F„)+(FJ- 


fi) 


(23) 

<F„T,) 


The  generalization  to  the  earlier  non-hierarchical  case  is 
now  straightforward.  We  choose  an  update  vector  (u,  t')*+1 
along  the  line  segment  joining  |(u,i>)*  +  (U,V)  -  (U,  F)ju 
and  [(u,t>)*  +  (If,  V)  -  (U,  V)|u  -  (p,q)  using  a  linear  com¬ 
bination  of  these  two  vectors,  that  is 


(u,v)*+1  =  (l-f)[(u,u)*  +  (U,F)-(t/,F)| 

A  t\(ti,i>)k  +  (0,V)-(U,V)-(p.q) | 
=  \(u,v)k  +  (U,V)~(U,V)\~t(p,q)  (24) 


where  t  €  [0, 1 )  selects  a  position  between  the  two  end 
points.  When  Equation  23  is  substituted  into  this  equation, 
we  can  make  the  result  equivalent  to  Equation  17  if,  as  be¬ 
fore,  we  use  t  as  defined  in  Equation  22.  In  Figure  3.  t  is 
graphed  as  a  function  of  the  magnitude  of  the  spatial  gradi¬ 
ent  jjFr,  Fyll  with  the  parameter  a  controlling  the  shape  of 
this  function.  When  the  gradient  is  small  t  goes  to  zero  and 
(u,i>)*+!  is  chosen  near  [(u,t>)‘  +  (U,V)  -  (£/,  V))n.  On  the 
other  hand  as  the  gradient  gets  very  large  t  goes  to  1.0  and 
(u,u)*+l  is  chosen  near  |(u,u)*  +  ({/,  V)  -  (U,  V)]n  -  (p, q). 
The  midway  point,  giving  equal  weight  to  both  possibili¬ 
ties,  occurs  when  the  magnitude  of  the  gradient  is  equal  to 
a.  Larger  values  of  a  “stretch”  the  curve  out  to  the  right 
so  that  higher  values  of  the  spatial  gradient  are  needed  to 
pull  the  update  vector  towards  the  constraint  line. 


739 


4  Hierarchical  Disparity 
Algorithms 

4.1  Hierarchical  Data  Flow 

Data  flow  graphs  for  the  hierarchical  disparity  algorithm 
are  shown  in  Figures  5-7.  Figure  5  shows  the  coarse-to-fine 
flow  of  cont  rol  and  data.  The  hierarchical  disparity  compu¬ 
tation  is  a  coarse-to-fine  algorithm  operating  on  two  low- 
pass  image  pyramids.  The  PROJECTION  operator  passes 
the  updated  disparity  field  to  the  next  level  of  the  process¬ 
ing  cone.  The  REFINEMENT  procc-s  computes  an  updated 
disparity  field  given  an  estimated  disparity  field  and  the  im¬ 
age  pyramid  data  at  the  corresponding  resolution.  A  data 
flow  diagram  of  REFINEMENT  is  shown  in  Figure  6.  Update 
vectors  are  computed  using  an  iterative  relaxation  process 
shown  in  Figure  7.  A  more  detailed  description  of  the  com¬ 
ponents  of  the  hierarchical  disparity  algorithm  is  given  in 
ICIazer  87.  Chapter  V,  Section  4|. 

4.2  Projection 

The  Projection  operator  passes  the  coarse  disparity  vector 
of  the  father  to  its  four  sons  with  two  changes.  First,  the 
components  of  the  vector  are  multiplied  by  2,  the  inter-grid 
sparing  ratio.  This  converts  between  the  two  normalized 
coordinate  systems.  Second,  the  values  are  then  truncated 
to  the  nearest  integer.  This  simplifies  the  computation  of 
gradients  to  follow. 


4.3  Gradients 


The  gradients  that  must  be  computed  are  6ZF,  6yF.  and 
Ft  fj  (see  Equations  16-17),  where  Ft (x)  =  F(x+U)  and 
F  =  j ( F]  +  Ft).  Since  bz  and  by  are  linear  operators,  bzF  = 
j  (bzF]  bzFt)  and  similarly  for  byF.  When  refinement  is 
being  performed  at  a  pixel  bzF\  is  a  first  difference 

of  F\  computed  at  (i,j)  and  bzFt  is  a  first  difference  of  Ft 
computed  at  (t  +  U,j  +  V). 

Upon  projection,  the  approximate  disparity  (U,V)  is 
truncated  to  the  nearest  integer.  This  allows  for  the  use  of 
a  single  set  of  first  difference  operators.  Any  reduction  in 
accuracy  is  corrected  by  the  subsequent  updating  process. 
Other  techniques  can  be  used  including  (1)  no  truncation, 
with  interpolated  first  difference  operators,  or  (2)  trunca¬ 
tion  to  some  subpixel  precision,  with  a  finite  set  of  first  dif¬ 
ference  operators.  The  tradeoffs  involved  in  the  selection 
of  such  technique  are  discussed  in  (Glazer  87,  Chapter  5, 
Section  4.4.3], 

The  first  difference  operators  Sz  and  6y  specified  in  Sec¬ 
tion  3.2.1  include  some  smoothing.  Smoothing  is  also  in¬ 
troduced  into  the  computation  of  Fj  -  F\  using  the  filter 


12  1 
2  4  2 
1  2  1  . 


Figure  5:  Hierarchical  disparity:  data  flow 

The  overall  hierarchical  disparity  computation  is  a 
coarse-to-fine  algorithm  operating  on  two  low-pass  image 
pyramids  (shown  in  dotted  outline).  The  REFINEMENT 
processes  compute  an  updated  disparity  field  given  an 
estimated  disparity  field  and  the  image  pyramid  data  at 
the  corresponding  resolution.  The  PROJECTION  opera¬ 
tor  passes  the  updated  disparity  field  to  the  next  level 
of  the  processing  cone. 


Thus,  the  three  gradients  are  computed  using  three  pairs 
of  3  x  3  neighborhood  operators,  one  of  the  pair  about  a 
point  fi(ij)  and  the  other  about  ft(i  +  U,j  +  V). 

These  three  gradients  cannot  be  computed  if  either  of 
the  3x3  neighborhoods  does  not  lie  completely  within  its 
respective  image.  This  is  the  first  error  condition  that  may 
occur,  called  search  area  overflow.  At  pixels  where  this 
occurs,  the  gradients  are  set  to  (0,0,0). 

4.4  Edge  Flow 

The  edge  flow  is  computed  as  specified  by  Equation  16. 
Three  error  conditions  may  be  encountered  at  any  given 
pixel.  First,  if  search  area  overflow  has  occurred  then  there 
are  no  gradient  values  available.  Second,  the  spatial  gra¬ 
dient  may  either  be  zero  or  too  small  to  give  an  accurate 
answer  when  used  as  a  divisor.  The  third  error  condition 
occurs  when  the  edge  flow  vector  is  too  large.  The  deriva¬ 
tion  of  this  constraint  is  presented  in  the  next  paragraph. 
For  all  three  error  conditions  we  set  (u0,  t>o)  =  (0,0). 


Figure  6:  Hierarchical  disparity:  refinement 

The  REFINEMENT  process  performed  at  each  level  inputs 
an  estimated  disparity  (U,V)  and  image  pyramid  data 
/*  and  T{ .  Single-level  (non-hierarchical)  image  opera¬ 
tors  compute  (1)  the  spatial  gradients  and  image  differ¬ 
ences  V/,  (2)  the  edge  flow  at  each  pixel  (u0,  vo):  and 

(3)  an  update  vector  (u,v).  The  update  vector  is  added 
to  the  estimated  disparity  to  give  the  updated  (refined) 
disparity.  An  array  of  error  flags  (err),  one  per  pixel,  is 
carried  along  through  the  operations  at  one  level.  The 
operators  shown  may  generate  error  flags  or  respond  to 
previously  generated  ones. 


The  third  error  condition  is  a  consistency  check  on  the 
computed  edge  flow.  In  Section  3.1.2  we  noted  that  because 
(1)  the  edge  flow  must  be  smaller  (in  magnitude)  than  the 
update  vector,  and  (2)  availability  of  an  approximate  dis¬ 
parity  implies  a  bound  on  the  size  of  the  update  vector, 
for  these  reasons  an  upper  bound  is  placed  on  the  size  of 
the  edge  flow  vector.  We  can  use  either  of  the  following 
relationships: 


ll(“o.t'o)ils  <  ||(«.V)||,  (25) 

||(«o,Vo)I!j  <  >/2  ||(tt,u)||M«  (26) 

where  ||(x,y)||j=  ( x 1  +  yJ)1/J  is  the  Euclidean  norm  and 
II  (x,y)  ||m«=  max(|x|,|y|)  is  the  Maximum  norm.  The 
second  inequality  follows  from  the  first  and  from  the  fact 
that  ||  (u,v)  || j <  s/2  ||  (u,  v)  ||M«  (since  0  <  uJ  +  v2  < 


Figure  7:  Hierarchical  disparity:  relaxation 

The  update  vector  is  computed  by  iterative  application 
of  the  relaxation  operator  specified  in  Equation  17  (when 
the  error  flag  is  set  to  NO  ERROR).  The  edge  flow  (uq,  no) 
is  used  as  the  first  estimate  of  the  update  vector  input 
to  RELAXATION.  Thereafter,  the  current  values  of  (u,e) 
are  used.  The  LAPLACIAN  of  (U,  V]  is  only  computed 
once. 


2[max(|x|,  |t/|)|2).  The  second  inequality  is  used  when  we 
have  bounds  on  the  individual  components  of  the  update 
vector.  For  example,  the  truncation  of  the  approximate 
disparity  vectors  to  the  nearest  pixel  upon  projection  in¬ 
troduces  an  error  of  ±1  pixel  in  each  of  its  components. 
Were  this  the  only  error  in  that  disparity  estimate,  then 
the  bound  on  the  size  of  the  update  vector  would  be 
||  (u,v)  lljvf.ii!  I  and  using  Equation  26  we  get  the  con¬ 
straint  |[ («o,  zr0)  |j <  ^r.  Thus  we  have  established  a  max¬ 
imum  allowable  value  EFm  of  the  magnitude  of  the  edge 
flow  ||(u0,  i'o)||2. 


4.5  Relaxation 

Update  vectors  are  computed  by  an  iterative  relaxation  pro¬ 
cess  specified  by  Equation  17.  If  any  of  the  possible  error 
conditions  exist  at  a  pixel,  then  no  adequate  constraint  line 
exists  at  that  pixel.  Thus  the  update  equations  cannot  be 
used  as  is.  However,  as  we  noted  in  Section  3.3,  the  update 
equation  can  be  interpreted  as  specifying  that  we  (1)  com¬ 
pute  an  update  vector  based  on  the  average  of  neighboring 
information,  followed  by  (2)  projection  of  this  vector  to¬ 
wards  the  constraint  line,  then  we  see  that  step  1  can  still 
be  performed.  Thus  we  arrive  at  an  update  algorithm  in 
which  at  each  pixel  one  of  two  things  takes  place:  either 
(1)  update  by  Equation  17  at  non-error  points,  or  (2)  up¬ 
date  by  simple  smoothing  at  points  with  errors.  The  latter 
is  a  form  of  interpolation  into  points  (or  areas)  with  no 
motion-based  constraints.  If  there  are  large  blocks  where 
updating  by  (1)  cannot  take  place,  then  vectors  in  those 
regions  are  determined  from  the  constrained  vectors  which 
surround  the  respective  regions. 


741 


r 

i 


The  average  update  vectors  (u,0)  are  computed  using 
the  following  local  averaging  filter. 


The  discrete  Laplacian  used  to  compute  (At/,  AU)  is 


’  1/4 

1/2 

1/4  ‘ 

t 

A 

'  1  2  1 

Lapi  = 

1/2 

-3 

1/2 

2  -12  2 

.  1/4 

1/2 

1/4  . 

4 

1  2  1  . 

5  Experiments 

5.1  The  Failure  of  Single  Level  Methods 

As  we  noted  in  the  introduction  to  this  chapter,  gradient- 
based  disparity  computations  will  fail  when  the  disparities 
are  large  relative  to  the  range  over  which  the  image  “func¬ 
tion"  is  approximately  linear.  In  general,  this  is  a  relatively 
short  distance.  Such  failure  takes  the  form  of  incorrect  dis¬ 
parity  estimates  based  on  a  model  of  the  local  image  neigh¬ 
borhood  that  provides  a  bad  approximation. 

The  following  series  of  experiments  show  the  inability 
of  single  level  gradient-based  methods  to  compute  dispar¬ 
ity  fields.  Four  experiments  are  performed  with  disparities 
which  start  at  a  low  magnitude  and  increase  by  a  factor 
of  two  for  eacli  subsequent  experiment.  We  expect  the  dis¬ 
parity  computations  with  single  gradient  based  methods 
to  become  more  errorful  as  the  disparities  increase  in  size. 
This  is  in  fact  what  the  experiments  show. 

Two  128  ■  128  pieces  of  the  MANDRILL  image  are  used 
with  a  disparity  of  (-5,7 )  from  frame  1  to  frame  2,  i.e. 
5  pixels  tip  and  7  to  the  right.  Low-pass  pyramids  were 
built  from  these  images  and  four  separate  runs  of  single 
level  disparity  computation  were  performed  at  levels  4  (162) 
through  7  (1282).  Thus  the  disparities  at  each  level  were 
(  .625.  .875),  ( -  1.25,  1 .75).  (-  2.5, 3.5),  and  (-  5,7)  respec¬ 
tively.  The  EDGE  FLOW  TOO  BIG  error  condition  was  in¬ 
cluded  with  maximum  allowable  edge  flow  magnitudes  of 
1.5,  3,  6,  and  12  respectively.  When  this  error  condition  is 
not  included,  results  are  worse. 

The  results  are  shown  in  Figures  8  and  9.  In  Figure  8, 
for  all  four  levels,  we  show:  (left)  the  first  frame  of  im¬ 
age  data;  (renter)  disparity  vectors  after  50  iterations;  and 
( right)  error  flags.  The  light  gray  pixels  on  the  border  of  the 
error  flags  display  indicate  the  SA  OVERFLOW  error  condi¬ 
tion.  The  predominant  dark  pixels  throughout  the  interior 
of  the  error  Rags  display  indicate  the  EDGE -FLOW .  TOO  BIG 
condition.  In  Figure  9  the  statistics  of  the  computed  dis¬ 
parities  are  compared  to  the  expected  values. 


It  is  clear  that  as  the  disparity  is  increased  relative  to 
the  pixel  spacing,  the  performance  of  single  level  gradient- 
based  methods  deteriorates  quickly.  At  level  4,  with  ex¬ 
pected  disparity  equal  to  (-.625,  .875),  results  are  good. 
At  level  5,  with  expected  disparity  equal  to  (-1.25,1.75), 
accuracy  is  significantly  diminished.  At  level  6,  with  ex¬ 
pected  disparity  equal  to  (-2.5, 3.5),  computed  disparities 
are  largely  incorrect.  At  level  7,  with  expected  disparity 
equal  to  (-5,7),  computed  disparities  cluster  near  (0,0) 
and  show  little  relation  to  the  expected  disparity. 

5.2  Hierarchical  Method 

The  hierarchical  gradient-based  disparity  algorithm  is 
demonstrated  in  the  following  experiment.  Again  1282 
pieces  of  the  MANDRILL  image  are  used  with  a  disparity 
of  (  -5,7).  A  low-pass  pyramid  was  built  for  each  input 
image.  Processing  was  begun  at  level  4. 

In  Figure  10  disparity  vectors  at  different  stages  of  the 
computation  are  shown.  The  edge  flow  at  level  4  is  shown 
in  Figure  10a.  It  is  labeled  as  iteration  0  since  it  is  used 
as  the  initial  approximation.  The  update  vectors  after  10 
iterations  are  shown  in  Figure  10b.  Up  to  this  point,  this 
experiment  is  equivalent  to  the  first  single  level  experiment 
of  the  last  section.  If  relaxation  continues  at  this  level  to 
the  50t,h  iteration,  the  update  vectors  are  then  the  same  as 
those  shown  at  level  4  in  Figure  8. 

At  level  5,  Figure  10c  shows  both  'he  approximate  dis¬ 
parity  passed  down  from  level  4  and  the  initial  (iteration 
0)  update  vector  —  the  edge  flow.  Figure  lOd  shows  the 
initial  disparity  and  the  final  (iteration  10)  update  vector. 
At  each  pixel  shown  (every  2nd  row  and  every  2nd  column) 
the  approximate  disparity  is  attached  by  its  tail  to  the  pixel 
in  frame  1  that  it  corresponds  to.  The  tail  of  the  update 
vectors  are  attached  to  the  head  of  the  approximate  dis¬ 
parity  vectors  and  have  an  arrowhead  at  their  head.  The 
arrowheads  are  proportional  in  size  to  the  update  vector 
and  do  not  show  up  when  that  vector  is  small.  The  sum 
of  the  approximate  and  the  update  vectors  is  the  updated 
disDaritv  estimate,  shown  in  Figure  lOe. 

Figures  lOf-h  show  the  comparable  vectors  at  level  6. 
The  final  results  at  level  7  are  shown  in  Figure  lOi. 

In  Figure  11  the  error  flags  are  shown.  Light  gray  pixels 
around  the  borders  are  SA  OVERFLOW  errors.  Dark  pixels 
are  mostly  EDGE-FLOW  _TOO -BIG  errors.  There  were  2.  8, 
188.  and  1021  EDGE. FLOW -TOO -BIG  errors  at  levels  4.  5, 
6  and  7  respectively  and  l  ZERO-GRAD  error  at  level  7. 

In  Figure  12,  histograms  of  the  disparity  fields  at  each  of 
the  levels  are  shown.  These  histograms  are  built  by  count¬ 
ing  all  disparity  vectors  that  lie  within  ±  j  of  each  pixel.  For 
example,  the  number  at  location  (0,0)  in  each  histogram 
is  the  number  of  disparity  vectors  with  row  and  column 
components  both  in  the  range  |-|,  |). 


742 


I 


Level  4 


Actual 

Drow 

DCol 

mean 

-.579 

.820 

so 

.066 

.095 

minimum 

-.749 

.585 

maximum 

-.364 

1.037 

Expected 

-.625 

.875 

Level  6 


Actual 

Doom 

Dcol 

mean 

-0.924 

1.01 

SD 

1.066 

1.26 

minimum 

-5.362 

-3.92 

maximum 

4.538 

5.19 

Expected 

-2.5 

3.5 

Level  5 


Actual 

Drow 

Dcol  1 

mean 

-1.112 

1.464  1 

SD 

.284 

.407  ! 

minimum 

-2.321 

.471 

maximum 

-0.317 

3.184  | 

Expected 

-1.25 

1.75  I 

Level  7 


Actual 

DTow 

Deo,  ! 

mean 

-0.47 

0.50 

SD 

1.74 

1.69  ! 

minimum 

-11.03 

-7.42  ! 

maximum 

-7.28 

10.55  i 

Expected 

-5. 

7. 

Figure  9:  Single-level  analysis:  Disparity  Statistics 


The  hierarchical  gradient-based  algorithm  uses  a 
coarse-to-fine  control  strategy  to  progressively  refine  in¬ 
dividual  motion  estimates.  This  strategy  can  also 
be  applied  to  correlation-based  motion  detection  al¬ 
gorithms  [Wong  At  Hall  78,  Moravec  81,  Glazer  et  al.  83, 
Anandan  84,  Quam  84 j.  Two  specific  hierarchical  al¬ 
gorithms,  one  correlation-based  [Glazer  87,  Chapter  4] 
and  the  gradient-based  algorithm  presented  here,  have 
been  shown  by  experiment  to  have  comparable  accuracy 
[Glazer  87).  Furthermore,  comparison  of  the  computa¬ 
tional  costs,  both  arithmetic  and  data  transfer,  show  that 
the  gradient-based  algorithms  are,  in  general,  less  costly 
[Glazer  87,  Chapter  7).  The  computational  costs  of  hi¬ 
erarchical  correlation  are  relatively  high  for  two  reasons: 
(1)  many  multiplications  and  additions  are  needed  for  the 
correlation  inner  product;  and  (2)  many  pixel  values  must 
be  transferred  over  multi-pixel  distances  in  the  image  plane. 


In  Figure  12,  disparity  vectors  are  included  in  the 
histograms  only  if  no  error  has  been  detected  at  their 
respective  pixel  locations.  The  expected  disparities  are 
(-.625,  .875),  (-1.25,1.75),  (-2.5, 3.5),  and  (-5,7)  at  lev¬ 
els  4  through  7  respectively.  The  center  of  gravity  of 
these  histograms  is  one  measure  of  accuracy.  For  the  his¬ 
tograms  shown,  the  centers  of  gravity  are  (-.8918,-5309), 
(-1.782,1.384),  (-2.919,3.401),  and  (-4.929,6.659)  at  lev¬ 
els  4  through  7  respectively.  In  all  cases,  both  components 
are  within  pixel  spacing  of  the  expected  disparity  val¬ 
ues. 

In  Figure  13,  the  statistics  of  the  row  and  column  com¬ 
ponents  of  the  disparity  vectors  are  shown.  Again  only 
vectors  at  non-error  pixels  are  included.  The  mean  val¬ 
ues  are  not  equal  to  the  above  listed  centers  of  gravity  of 
the  histograms,  because  the  histograms  represent  disparity 
vectors  with  components  truncated  to  the  nearest  integer. 
These  can  be  compared  to  the  corresponding  statistics  for 
the  single  level  shown  in  Figure  9.  The  hierarchical  method 
clearly  succeeds  where  the  single  level  method  fails. 


6  Summary  and  Discussion 

In  this  report  we  have  demonstrated  that  first  order 
gradient-based  methods  may  be  extended  to  the  computa¬ 
tion  of  large  disparities  by  formulating  a  hierarchical  gen¬ 
eralization  of  the  single  level  method  that  operates  on  low- 
pass  image  pyramids.  A  thorough  formal  analysis  of  the 
gradient-based  updating  equations  was  presented  as  well 
as  the  implementation  of  the  method  in  a  hierarchical  pro¬ 
cessing  cone  algorithm.  Experiments  were  presented  both 
to  show  how  single  level  methods  fail  and  how  the  hierar¬ 
chical  method  succeeds. 


References 

|Anandan  84)  Anandan,  P.,  Computing  Dense  Displace¬ 
ment  Fields  with  Confidence  Measures  in  Scenes  Con¬ 
taining  Occlusion,  SPIE  Vol.  521  Intelligent  Robots  and 
Computer  Vision,  Cambridge,  Mass.,  1984. 

[Burt  81]  Burt,  P.J.,  Fast  Filter  Transforms  for  Image  Pro¬ 
cessing,  CGIP  16:20-51,  1981. 

[Burt  82)  Burt,  P.J.,  Pyramid-Based  Extraction  of  Local 
Image  Features  with  Applications  to  Motion  and  Tex¬ 
ture  Analysis,  Proc.  SPIE  Conf.  on  Robotics  and  In¬ 
dustrial  Inspection,  San  Diego,  1982. 

[Cornelius  At  Kanade  83]  Cornelius,  N.  and  Kanade,  T., 
Adapting  Optical-Flow  to  Measure  Object  Motion  in 
Reflecting  and  X-ray  Image  Sequences,  Proc.  ACM 
SIGGRAPH/SIGGART  Interdisciplinary  Workshop  on 
Motion:  Representation  and  Perception,  pp.  50-58, 
1983. 

[Courant  At  Hilbert  53)  Courant,  R.  and  Hilbert,  D.,  Meth¬ 
ods  of  Mathematical  Physics,  Volume  I,  Interscience 
Publishers,  Inc.,  New  York,  1953. 

jFennema  At  Thompson  79)  Fennema,  C.L.  and  Thomp¬ 
son,  W.B.,  Velocity  Determination  in  Scenes  Contain¬ 
ing  Several  Moving  Objects,  CGIP  9(4):301~315,  1979. 

[Glazer  81)  Glazer,  F.,  Computing  Optic  Flow,  Proc.  7th. 
IJCAI,  pp.  644-647,  Vancouver,  BC,  1981. 

[Glazer  et  al.  83)  Glazer,  F.,  Reynolds,  G.,  and  P.  Anan¬ 
dan,  Scene  Matching  by  Hierarchical  Correlation,  Proc. 
CVPR  pp.  432-441,  1983. 


744 


d 


//////////r  r  r  r  f  i 
/'/'/'/'/'////'//(  r  r  f  r 
/■/■/'/'///////<’  r  rf  r 


Figure  10:  Multilevel  experiment:  disparity  vectors 

(a)  level  4,  iteration  0  (b)  level  4,  iteration  10  (c)  level  5,  iteration  0 

(d)  level  5,  iteration  10  (e)  level  5,  updated  vectors  (f)  level  6,  iteration  0 

(g)  level  6,  iteration  10  (h)  level  6,  updated  vectors  (i)  level  7,  updated  vectors 


745 


I 


■ 


Figure  11:  Multilevel  experiment:  error  flags 


Error  flags  at  levels  4  through  7  for  the  multilevel  experiment.  Light  gray 
pixels  around  the  borders  are  SEARCH -ARE A -OVERFLOW  errors.  Dark  pixels 
are  predominantly  EDGE-FLOW -TOO-BIG  errors. 


(Glazer  87]  Glazer,  F.,  Hierarchical  Motion  Detection, 
Ph.D.  Thesis  and  COINS  Tech.  Report  87-02, 
U. Massachusetts,  Amherst,  1987. 

[Hanson  &  Riseman  74]  Hanson,  A.  and  Riseman,  E.M., 
Preprocessing  Cones:  A  Computational  Structure 
for  Scene  Analysis,  COINS  Tech.  Report  74C-7, 
U. Massachusetts,  Amherst,  1974. 

[Hanson  Sc  Riseman  80]  Hanson,  A.  and  Riseman,  E.M., 
Processing  Cones:  A  Computational  Structure  for  Im¬ 
age  Analysis,  In:  Structured  Computer  Vision ,  Tani- 
moto,  S.  and  Klinger,  A.  (Eds.),  Academic  Press,  New 
York,  1980. 

[Horn  Sc  Schunck  81]  Horn,  B.K.P.  and  Schunck,  B.G.,  De¬ 
termining  Optical  Flow,  Artificial  Intelligence  17(1- 
3):185-204,  1981.  Also  in  Proc.  IUW  (DARPA),  April, 
1981. 

(Kahn  85)  Kahn,  P.,  Loral  Determination  of  a  Moving  Con¬ 
trast  Edge,  IEEE  Trans.  PAMI  7(2):402-409. 


Moravec  81]  Moravec,  H.P.,  Robot  Rover  Visual  Naviga¬ 
tion,  UMI  Research  Press,  Ann  Arbor,  Michigan,  1981. 

Nagel  83]  Nagel,  H.-H.,  Displacement  Vectors  Derived 
from  Second  Order  Intensity  Variations  in  Image  Se¬ 
quences,  CVGIP  21:85-117,  1983. 

Quam  84]  Quam,  L.H.,  Hierarchical  Warp  Stereo,  Proc. 
IUW  (DARPA),  pp.  149-155,  October  1984. 

Schunrk  83]  Schunck,  B.G.,  Motion  Segmentation  and  Es¬ 
timation,  Ph.D.  Thesis,  MIT,  Dept,  of  EE  &  CS,  1983. 

Tanimoto  Sc  Pavlidis  75]  Tanimoto,  S.,  and  Pavlidis,  T., 
A  Hierarchical  Data  Structure  for  Picture  Processing, 
CGIP  4(2):104-1 19,  1975. 

Wong  Sc  Hall  78]  Wong,  R.Y.  and  Hall,  E.L.,  Sequen¬ 
tial  Hierarchical  Scene  Matching,  IEEE  Trans.  Comp. 
27(4):359-366,  1978. 

Yachida  81]  Yachida,  M.,  Determining  Velocity  Map  by  3- 
D  Iterative  Estimates,  Proc.  7th.  IJCAI,  pp.  716-718, 
Vancouver,  B.C.,  Canada,  1981. 


747 


Level  4 

Level  5 

0  1 

0 

1 

2 

-11  81  92 

-21 

9 

365 

326 

01  10  11 

-11 

2 

166 

24 

01 

0 

0 

0 

Level  8 

0 

1 

2 

3 

4 

-41 

0 

2 

4 

2 

0 

-31 

1 

19 

68 

1576 

1561 

-21 

1 

21 

18 

197 

35 

-11 

0 

4 

3 

3 

1 

01 

0 

0 

0 

0 

0 

Level  4  Level  5 


Level 

7 

0 

1 

2 

3 

4 

5 

6 

7 

8 

-81 

0 

0 

1 

2 

0 

0 

0 

0 

0 

-7| 

0 

1 

4 

2 

4 

3 

0 

0 

0 

-61 

0 

8 

19 

16 

27 

74 

62 

7 

2 

-51 

2 

5 

27 

60 

90 

248 

1263 

11195 

4 

-41 

3 

24 

37 

26 

65 

118 

206 

95 

6 

-31 

3 

15 

29 

19 

10 

36 

88 

40 

1 

-21 

0 

1 

13 

5 

8 

17 

8 

6 

0 

-1 . 

0 

0 

3 

0 

1 

1 

0 

0 

0 

01 

0 

0 

0 

0 

0 

0 

0 

0 

0 

Figure  12:  Hierarchical  method  :  disparity  histograms 

These  are  two-dimensional  histograms  of  the  disparity 
fields  at  levels  4  through  7  for  the  multilevel  experiment. 
At  each  level,  the  pixel  location  nearest  the  expected 
disparity  level  is  surrounded  by  a  box.  (At  level  6  four 
pixels  are  equidistant  from  the  expected  disparity,  so  the 
box  surrounds  all  four.) 


Level  6 


Level  7 


Figure  13:  Hierarchical  method:  disparity  statistics 

These  statistics  are  computed  over  those  pixels  at  which 
there  were  no  error  conditions.  This  included  194,  882, 
3513,  amd  14007  pixels  at  levels  4  through  7  respectively 
(out  of  a  possible  256,  1024,  4096,  and  16384).  These  are 
the  unmarked  pixels  in  Figure  11 


SHAPE  RECONSTRUCTION 
ON  A  VARYING  MESH 


Isaac  Weiss 

Center  for  Automation  Research 
University  of  Maryland,  College  Park,  MD  20742 


ABSTRACT 

A  central  class  of  IU  problems  is  concerned  with  re¬ 
constructing  a  shape  form  an  incomplete  data  set,  such  as 
fitting  a  surface  to  (partially)  given  contours.  We  present 
a  new  theory  for  solving  such  problems.  Unlike  the  cur¬ 
rent  heuristic  methods,  we  start  from  fundamental  princi¬ 
ples  that  should  be  followed  by  any  reconstruction  method, 
regardless  of  its  mathematical  or  physical  implementation. 
We  then  present  a  mathematical  procedure  which  conforms 
to  these  principles.  One  major  advantage  of  the  method 
is  the  ability  to  handle  shapes  containing  variations  of  dif¬ 
ferent  scales.  A  sharp  variation,  such  as  a  corner,  requires 
a  high  resolution  mesh  for  adequate  representation,  while 
slowly  varying  section  can  be  represented  with  sparser  mesh 
points.  Unlike  current  methods,  our  procedure  fits  the  sur¬ 
face  on  a  varying  mesh.  The  mesh  is  constructed  automat¬ 
ically  to  be  denser  at  parts  of  the  image  having  more  rapid 
variation.  Analytical  examples  are  given  in  simple  cases, 
followed  by  numerical  experiments. 

I.  INTRODUCTION 

A  central  problem  of  Image  Understanding  is  the  in¬ 
ference  of  a  3-D  shape  from  given  2-D  images.  According 
to  the  types  of  information  that  can  be  extracted  from  the 
images,  theories  have  developed  which  can  be  classified  un¬ 
der  categories  such  as  Shape  from  Stereo,  ?  ,'pe  from  Mo¬ 
tion,  Shape  from  Contour,  Shape  from  Sh  I'bng,  Shape  from 
from  Texture,  etc.  A  reliable  shape  reconstruction  mecha¬ 
nism  will,  apparently,  have  to  combine  several  of  these  kinds 
of  clues  in  a  consistent  way.  The  present  paper  addresses 
mainly  the  Shape  from  Contour  component  of  the  problem. 
However,  our  theory  is  based  on  a  general  extremum  prin¬ 
ciple  which  can  be  used  to  incorporate  most  of  the  other 
components  of  surface  reconstruction  as  well.  Related  top¬ 
ics,  such  as  smoothing  and  interpolation,  are  also  treated 
in  this  framework. 

The  Shape  from  Contour  problem  can  be  stated  as  fol¬ 
lows:  Given  a  set  of  contours,  such  as  may  be  provided 
by  some  image  processing  device,  one  wants  to  infer  the 
shape  of  the  surface  that  is  most  likely  described  by  these 
contours.  One  can  also  consider  an  inverse,  or  a  comple¬ 
mentary  problem:  given  the  surface,  one  seeks  its  "natural 
parametrization* ,  namely  contours  that  will  convey  its  es¬ 


sential  characteristics  in  the  most  economical,  yet  reliable 
way.  One  known  mechanism  that  performs  these  tasks  well 
is  the  human  visual  system.  A  few  drawn  lines  can  create  a 
surprisingly  vivid  and  convincing  impression  of  a  3-D  shape 
[Marr,  1982;  Barrow  Si  Tenenbaum,  1981]. 

In  this  paper,  we  address  both  sides  of  this  problem, 
and  thus  we  shall  refer  to  it  as  “Shape  Representation  by 
Contours",  which  includes  shape  from  contour  as  well  as 
contour  from  shape.  The  treatment  is  on  the  fundamen¬ 
tal  and  general  level  of  first,  finding  the  principles  that 
must  guide  any  process  of  representing  a  surface  by  con¬ 
tours,  regardless  of  its  physical  implementation,  and  sec¬ 
ond,  proposing  a  mathematical  mechanism  that  can  per¬ 
form  these  tasks,  conforming  with  our  principles.  We  as¬ 
sert  that  any  good  visual  system,  when  viewed  as  a  “black 
box”,  should  behave  in  a  way  similar  to  our  mathematical 
procedure.  We  do  not  attempt  to  specifically  emulate  the 
human  visual  system.  Nevertheless,  as  the  eye  is  the  most 
successful  image  processing  system  we  can  use,  we  shall  test 
the  performance  of  our  abstract  procedure  against  its  per¬ 
formance.  In  the  following  we  shall  briefly  review  some  of 
the  prior  work  and  then  summarize  the  requirements  that 
we  impose  on  a  surface  reconstruction  mechanism. 

Much  of  the  previous  work  has  been  concerned  with 
various  subsets  of  the  general  problem.  One  such  subset 
is  the  interpolation  of  a  2-D  curve,  given  points  in  a  2-D 
plane.  The  points  are  assumed  to  lie  in  the  vicinity  of  the 
curve,  but  not  necessarily  on  it.  The  common  approach 
the  problem  is  through  an  extremum  principle.  In  the  sim¬ 
plest  case,  one  assumes  that  the  curve  is  a  straight  line, 
and  performs  a  least  square  fitting.  For  general  curves,  one 
possibility  [Horn,  1977]  is  to  assume  that  the  interpolating 
curve  has  the  minimum  overall  curvature,  subject  to  the 
given  data  constraints.  For  instance,  given  the  coordinates 
of  two  end-points  of  a  curve  and  its  tangents  there,  one  can 
find  the  curve  t[s),  where  f  is  the  tangent  and  s  is  the  ar- 
clength,  that  minimizes  the  functional  E  =  / (di/ds)2,  over 
the  domain  of  functions  that  satisfy  the  given  boundary 
conditions.  (This  derivative  of  the  tangent  is  the  curvature 
k.)  By  a  physics  analogy,  this  is  also  known  as  the  mini¬ 
mum  energy  principle.  If  the  curve  of  least  energy  is  nearly 
straight,  it  can  be  approximated  by  cubic  spline  polynomi¬ 
als  spanned  between  some  set  of  mesh  points. 


749 


Smoothing  is  &  problem  closely  related  to  interpola¬ 
tion  and  reconstruction.  We  can  regard  the  noisy  curve  (or 
surface)  as  initial  data  from  which  a  smooth  curve  is  to 
be  constructed.  Thus  a  good  reconstruction  technique  can 
also  be  used  as  a  smoothing  method,  and  as  such  it  may 
have  a  very  wide  applicability  in  computer  vision,  beyond 
shape  reconstruction.  The  converse  is  not  true,  however. 

A  reconstruction  method  must  be  able  to  deal  with  sparse 
data  input,  e.g.  to  bridge  gaps  between  known  curves  or 
points,  or  build  surfaces  from  sparse  contours.  These  con¬ 
tours  do  not  need  to  be  smooth  and  may  contain  comers 
or  bumps.  In  short,  we  regard  interpolation  and  smoothing 
as  overlapping  subsets  of  the  reconstruction  problem. 

A  problem  shared  by  all  known  smoothing  techniques, 
and  by  the  above  minimum  curvature  principle  in  any  of 
its  applications,  is  the  handling  of  sharp  variations  in  the 
curve  such  as  corners  or  bumps.  A  technique  that  effi¬ 
ciently  removes  small  perturbations  like  noise  also  blunts 
corners  which  may  be  an  essential  part  of  the  shape.  A 
general  framework  for  smoothing  a  3-D  surface  is  given  in 
[Rosenfeld  and  Kak,  1982].  It  consists  of  minimizing  a  cost 
function  4>,  which  can  be  a  weighted  integral  over  the  pic¬ 
ture  of  (§{)J  +  (§£)*  +  (/  -  g )2,  where  /  is  a.  possible 
smoothing  of  a  noisy  picture  g.  Intuitively,  this  generalizes 
the  idea  that  Grimson  [1981]  called  informally  “no  news  is 
good  news” ,  namely  that  the  overall  change  in  the  surface, 
i.e.  its  derivatives,  or  some  curvature,  is  minimal  unless 
we  have  good  evidence  to  the  contrary.  Of  course,  the  ex¬ 
act  form  of  <t>  has  to  be  given,  and  it  involves  a  trade-off 
between  noise  removal  and  corner  preservation. 

The  smoothing  technique  derived  from  our  general  re¬ 
construction  formalism  is  a  departure  from  the  above  for¬ 
malism.  By  reparametrization  of  surface  we  will  able  to 
alleviate  or  eliminate  this  need  for  a  trade-off  of  noise  vs. 
blunted  corners  and  deal  with  curves  with  sharp  turns. 

Some  ad  hoc  variation  of  the  minimum  curvature  prin¬ 
ciple  have  been  applied  to  3-D  problems.  Barrow  and 
Tenenbaum  add  torsion  to  the  functional  E  containing  the 
curvature,  but  the  torsion,  being  a  higher  derivative,  ag¬ 
gravates  the  sensitivity  to  noise.  Terzopoulos  [1982],  [1985] 
regards  a  surface  as  a  thin  flexible  plate  under  tension,  at¬ 
tached  to  data  points  by  “springs”.  He  minimizes  the  en¬ 
ergy  of  such  a  system,  which  is  a  functional  of  the  surface 
and  its  derivatives.  This  amounts  to  choosing  a  simple  form 
of  <j>,  and  adding  first  derivatives  of  /,  with  parameters  re¬ 
flecting  the  above  trade-off.  These  parameters  need  to  be 
adjusted  empirically  for  each  application,  and  the  restric¬ 
tion  to  a  thin  plate  (one  that  is  not  too  different  from  a 
plane)  limits  the  model’s  usefulness. 

Another  subset  of  the  3-D  reconstruction  problem  to 
which  the  minimum  curvature  principle  has  been  applied  to 
is  (planar)  “deprojection”:  given  a  curve  on  a  plane  which 
is  slanted  and  tilted  relative  to  the  the  observer’s  line  of 
view,  estimate  the  most  likely  slant  and  tilt.  (The  observer 
can  only  see  a  projected  image  of  the  curve.)  Humans,  for 
example,  when  asked  to  interpret  an  apparent  ellipse  as  a 
slanted  shape,  will  regard  it  as  a  slanted  circle.  On  this 


assumption  the  slant  and  tilt  can  be  calculated.  According 
to  the  minimum  curvature  principle,  the  slant  and  tilt  are 
those  which  minimize  the  overall  curvature  of  the  “depro- 
jected"  curve. 

Several  problems  arise  in  these  applications.  First,  a 
very  high  penalty  is  incurred  by  corners  and  bumps  on  the 
curve.  Since  the  curvature  of  a  sharp  corner  tends  to  in¬ 
finity,  a  corner  will  tend  to  dominate  the  curvature  inte¬ 
gral.  Straight  lines,  having  zero  curvature,  will  be  ignored, 
regardless  of  their  length.  Thus  the  minimum  curvature 
principle  will  be  unable  to  handle  curves  with  corners  or 
even  small,  sharp  bumps,  that  may  be  caused  by  noise, 
and  long  straight  lines.  Second,  the  curvature  measure  is 
not  dimensionless.  The  interpolated,  or  deprojected  curve 
will  depend  on  the  size  of  the  image.  Third,  in  the  case  of 
the  slant  finding  problem,  the  deprojected  curve  is  not  in 
agreement  with  human  perception.  The  deprojection  of  an 
ellipse  will  not  be  a  circle. 

A  mechanism  which  partly  solves  the  problems  of  the 
minimum  curvature  principle  is  the  Brady- Yuille  [1984] 
maximum  compactness  measure.  Applied  to  the  slant  find¬ 
ing  problem  of  a  closed  planar  curve,  it  states  that  the 
deprojected  curve  is  the  one  for  which  the  ratio  of  the  area 
enclosed  by  the  curve  to  its  perimeter  squared  is  maximal. 

It  is  related  to  the  statistically-based  Maximum  Likelihood 
Method  of  Davis,  Janos  and  Dunn  [1982]  and  Witkin  [1981]. 
These  measures  are  inherently  insensitive  to  noise,  are  di¬ 
mensionless  and  gives  the  expected  intuitive  results.  These 
authors  did  obtain  the  circle  as  the  deprojected  ellipse,  and 
they  also  calculated  the  resulting  slant  and  tilt.  It  seems 
hard,  however,  to  apply  these  measures  to  more  general 
curves. 

The  experience  with  these  investigations  over  the  past 
few  years  enables  us  to  formulate  some  requirements  that 
a  surface  reconstruction  mechanism  should  satisfy,  and  it 
shows  the  need  for  a  new  mechanism  that  will  meet  them. 
To  meet  these  demands,  our  proposed  formalism  had  to  de¬ 
part  from  the  conventional  form  of  <j>  above,  by  redefining 
length  and  curvature  and  by  introducing  a  reparametriza¬ 
tion  of  curves  and  surfaces.  We  now  list  thepe  requirements 
and  touch  on  the  way  our  formalism  deals  with  them. 

1)  Surface  consistency:  A  surface  reconstruction  mech¬ 
anism  will  build  a  surface  that  does  not  change  very  much, 
unless  this  mechanism  has  information  indicating  such  a 
change,  such  as  boundaries  or  other  contours,  should  not 
add  extraneous  information  of  its  own.  We  shall  define  a 
functional  of  the  surface,  related  to  the  curvature,  that  the 
desired  surface  should  minimize. 

2)  Translational  and  rotational  invariance — i.e.  if  the 
input  image  is  rotated  or  translated,  the  output  should 
move  by  the  same  amount  but  not  change  its  shape. 

3)  Dimensionlessness,  or  scale  invariance:  When  the 
distance  between  an  object  and  a  viewer  changes,  the  ob¬ 
ject’s  apparent  size  changes,  but  not  its  shape.  Therefore, 
a  reconstruction  mechanism  should  yield  the  same  output 
shape  from  an  input  image,  regardless  of  its  size.  Thus  the 
mathematical  method  of  reconstruction  must  be  dimension- 


750 


less,  or  invariant  to  scale.  Partly  to  solve  this  problem,  we 
have  introduced  a  renormalization  of  the  arclength. 

4)  Handling  of  different  variation  scales:  As  mentioned 
above,  the  mechanism  has  to  handle  both  rapid  curvature 
changes,  such  as  sharp  corners,  without  blurring  them  too 
much,  while  also  taking  into  account  stretches  of  slow  vari¬ 
ation,  such  as  straight  or  slightly  curved  puts  of  the  shape. 
All  the  schemes  based  on  simple  derivatives  of  the  surface 
have  severe  difficulties  in  this  respect.  A  reparametrization 
of  the  curves,  derived  from  the  extremum  principle  itself, 
helps  us  to  solve  this  problem. 

5)  The  mechanism  should  have  a  wide  domain  of  ap¬ 
plicability,  including  closed  and  open  space  curves  and  sur¬ 
faces,  with  few  or  no  restrictions. 

The  following  requirements  do  not  seem  to  be  essential 
but  have  various  degrees  of  desirability: 

6)  Computational  efficiency:  This  is,  of  course,  a  func¬ 
tion  not  only  of  the  mathematical  scheme,  but  also  of  the 
specific  implementation  technique  and  the  hardware  archi¬ 
tecture,  which  are  issues  by  themselves.  We  can  generally 
say  that  since  the  bulk  of  the  computation  in  our  theory 
is  done  on  local  variables,  the  scheme  lends  itself  to  im¬ 
plementation  on  parallel  computers.  As  the  genera]  theory 
is  non-linear,  slowing  down  is  expected  compared  to  the 
special  cases  in  which  it  can  be  linearized,  such  as  “thin 
plates” . 

7)  One  would  prefer  the  same  mechanism  to  handle 
both  aspects  of  the  problem  mentioned  above,  namely  find¬ 
ing  both  the  surface  and  a  suitable  set  of  coordinates  on  it. 
While  previously  suggested  mechanisms  only  attempt  to 
find  the  surface,  ours  also  lead  to  a  “natural”  parametriza- 
tion  of  it. 

8)  Consistency  with  human  perception:  For  special 
cases  we  can  obtain  analytic  results  and  compare  them  with 
human  perception,  with  good  agreement.  For  example,  a 
circle  which  is  known  to  be  an  occluding  boundary  will  give 
rise  to  a  sphere. 

The  first  part  of  the  paper  describes  the  theory.  We 
start  with  curves  in  the  2-D  plane,  and  continue  with  gen¬ 
eral  surfaces  in  3-D.  The  second  part  describes  numerical 
experiments. 

II.  SMOOTH  CONTOURS 

In  this  section  we  shall  deal  with  the  particular  case  in 
which  a  simple  minimum  curvature  principle  gives  at  least 
mathematically  definite  and  stable  results,  namely  smooth 
curves,  without  noise  or  comers.  We  shall  modify  it  in  a 
way  that  will  make  it  dimensionless,  and  will  also  make  it 
amenable  to  treatment  by  a  standard  analytic  minimization 
technique.  We  shall  use  this  technique  to  find  the  equation 
for  the  curve  of  minimal  modified  curvature  that  passes  be¬ 
tween  two  end-points.  We  shall  also  obtain,  as  a  byproduct, 
the  circle  as  the  “deprojection”  of  the  ellipse.  We  will  pre¬ 
serve  the  useful  properties  of  the  energy  principle. 

As  curvature  is  a  good  measure  of  surface  change,  the 
mechanism  will  seek  a  surface  which  has  the  minimal  over¬ 
all  (modified)  curvature,  while  being  consistent  with  the 
available  information.  It  is  also  qualitatively  clear  that  in¬ 


formation  is  closely  related  to  the  smoothness  of  the  curve 
or  the  surface.  A  straight  line  can  be  described  by  the  co¬ 
ordinates  of  the  end-points  only,  while  a  more  complicated 
shape  will  require  more  information.  Therefore,  minimizing 
curvature  is  in  accordance  with  minimizing  the  extraneous 
information  generated  by  the  mechanism  itself. 

As  one  wants  to  represent  the  amount  of  information 
contained  in  a  curve  by  its  curvature  or  a  related  quan¬ 
tity,  the  natural  mathematical  quantities  to  deal  with  are 
the  derivatives  of  the  tangent  vector,  such  as  the  curvature 
vector  k  =  dt/ds  .  As  we  demand  rotational  invariance,  we 
have  to  use  a  scalar  product  of  these  vectors.  This  leads  to 
the  suggestion  of  the  principle  of  minimum  “spline  bending 
energy”  [Horn,  1981],  by  which  one  wants  to  minimize  the 
integral 

E. p„„«  =  J  Pds  =  I  ds 


taken  over  the  whole  curve.  This  expression  approximates 
the  bending  energy  of  a  flexible  stick,  such  as  a  “spline” 
used  by  draftsmen  to  draw  a  curve  between  points. 

This  measure  has  been  implemented  in  several  ways. 
First,  in  an  interpolation  problem:  given  the  coordinates 
and  tangents  of  two  points  of  the  curve,  one  can  find  the 
curve  that  will  pass  through  them  and  will  minimize  this 
integral.  Second,  in  a  deprojection  problem:  given  a  closed 
boundary  such  as  an  ellipse,  one  can  try  to  interpret  it 
as  a  projected  image  of  a  surface  in  3-D.  The  human  eye 
will  usually  regard  an  ellipse  as  a  slanted  circle,  and  we 
would  like  an  extremum  principle  to  yield  the  circle  as  an 
extremum  over  the  set  of  all  curves  compatible  with  the 
projected  ellipse.  This  would  be  consistent  with  the  result 
obtained  by  the  Maximum  Likelihood  method.  Limiting 
ourselves  to  planar  curves  (zero  torsion),  this  compatible 
set  is  the  set  of  ellipses  (produced  by  different  slant  angles), 
with  the  same  major  axis  as  the  apparent  one. 

The  minimum  curvature  principle  fails  in  this  task  as 
it  does  not  lead  to  the  circle  as  the  extremum  over  the 
set  of  ellipses.  Even  worse,  as  it  is  not  dimensionless,  the 
extremum  will  depend  on  the  apparent  size  of  the  image, 
contrary  to  to  our  demands. 

We  now  propose  a  modification  of  the  minimum  curva¬ 
ture  principle,  which  will  work  for  smooth  curves.  Letting 
L  be  the  total  length  of  the  curve,  we  can  define  the  dimen¬ 
sionless  variable 


where  s  is  the  length  of  the  curve  lying  between  a  point 
on  the  curve  and  one  of  its  ends.  /  is  a  measure  of  the 
relative  position  of  a  point  on  the  curve,  regardless  of  the 
curve’s  length.  We  define  the  “normalized”  energy  En  as 
the  integral 


E, 


Jo  dl  dl 


The  difference  between  E,pune  and  the  normalized  energy 
E„  is  the  replacement  of  ds  by  dl.  This  is  equivalent  to 
multiplying  the  “spline  energy”  by  L.  As  both  I  and  f  are 
dimensionless,  so  is  En. 


751 


Now,  in  accordance  with  our  modified  minimal  curva¬ 
ture  requirement,  we  shall  look  for  a  curve,  t[l),  that  will 
extremise  the  normalised  energy  En  over  the  set  of  all  pos¬ 
sible  curves  having  the  same  t  at  the  end-points. 

An  important  feature  of  the  normalized  arclength  l  is 
that  its  values  at  the  end-points  are  fixed,  in  this  case  to  0 
and  1.  This  is  in  contrast  to  the  original  arclength  s,  whose 
value  at  one  end  is  the  length  of  the  curve,  and  it  is  more 
in  the  character  of  an  ordinary  coordinate  such  as  z  or  y. 
This  makes  it  possible  to  use  l  as  the  independent  variable 
in  many  mathematical  techniques  for  which  s  in  not  suited. 
In  particular,  we  are  interested  now  in  techniques  from  the 
calculus  of  variations  for  the  minimization  of  the  energy 
integral.  Other  works  have  also  used  these  techniques,  but 
only  after  assuming  that  the  curve  (or  surface)  is  nearly 
straight,  so  that  a  could  be  approximated  by  z. 

We  will  deal  in  this  paper  with  functionals  E  which  de¬ 
pend  on  several  functions  ij>i(xk),  such  as  curve  or  surface 
tangents,  and  their  derivatives.  Here  xk  —  x1  ...xn,  where 
n  is  the  number  of  independent  variables.  A  necessary  con¬ 
dition  for  fa  to  minimize  the  functional  E  (over  some  do¬ 
main  of  admissible  functions)  is  that  for  all  infinitesimal 
variations  6  V>*,  the  infinitesimal  variation  of  the  functional 
SE  vanishes.  Namely,  for  the  given  function  F  of  t/>i  and 
their  derivatives  dipi/dxk,  and  and  for  all  6$ ,■  one  must 

haVC  f  dib 

SE  =  6  I  )dxn  =  0 

where  dxn  is  the  volume  element  in  the  n-D  space. 

If  we  restrict  ourselves  to  the  domain  of  (sufficiently 
differentiable)  functions  with  fixed  boundaries,  such  as 
curves  with  known  coordinates  and  tangents  at  their  ends, 
then  the  above  equation  will  be  satisfied  if,  for  each  »,  F 
satisfies  the  Euler-Lagrange  equation  [Courant  and  Hiibert, 
1953]: 

_  y  JL  ( _ ** _ 'i  =  o 

dtl>i  "  dxk  \d  (dipi/dxk) ) 

In  our  normalized  energy  functional  En,  the  function  F 
is  a  simple  quadratic,  the  independent  variable  is  l,  and  the 
unknown  function  variables  tpi  are  the  components  of  t,  tx 
and  ty.  These  unknowns  are  not  independent,  though.  As 
t  =  dkjda  we  have  t*  + 1 J  =  1.  By  using  a  polar  coordinate 
system,  with  the  angle  4>  defined  as  the  angle  between  the 
tangent  f  and  (say)  the  z  axis,  we  will  obtain  an  equation 
of  a  single  function  variable  In  this  system  we  have 

tz  =  coe<£ 
ty  =  sin  ^ 

The  energy  now  reads 


In  the  case  that  the  only  information  we  have  about 
the  end-points  is  the  inclination  (tangent  angle)  of  the  curve 
there,  di  and  this  last  expression  suffices.  To  minimise 


Eh,  we  apply  the  Euler-Lagrange  equation  to  it  and  obtain 


with  the  unique  solution 

lj>  =  (^2  —  )/  +  4>1 

which  is  a  circular  arc.  Thus,  the  curve  which  minimizes  En 
over  the  domain  of  functions  having  given  tangent  angles 
at  the  ends,  is  a  circular  arc. 

In  particular,  when  4>\  —  —  */2,  we  obtain  two 

semi-circles  (upper  and  lower).  The  problem  of  deproject- 
ing  an  ellipse  is  now  solved  trivially.  Since  the  circle  mini¬ 
mizes  En  over  all  functions  with  end-points  perpendicular 
to  the  z  axis,  it  will  also  do  so  over  the  sub-domain  of  el¬ 
lipses  with  an  axis  along  z.  This  last  particular  result  can 
be  obtained  by  much  more  elementary  means,  because  one 
only  needs  to  find  the  value  of  one  parameter  (the  ellipse’s 
eccentricity,  related  to  the  slant).  The  general  interpolation 
problem,  however,  requires  the  full  power  of  the  calculus  of 
variations  as  we  are  looking  for  a  function  that  connects  the 
two  points. 

In  most  applications  one  has  more  input  data  them  the 
end  tangents.  One  can  include  the  spatial  coordinates  of 
the  end-points  as  constraints,  and  show  that  we  still  obtain 
a  circular  arc  [Weiss  1986a],  Other  input  information  can 
be  added  as  an  attraction  potential,  as  described  below. 

m.  LARGE-  AND  SMALL-SCALE  VARIATIONS 

The  simple  minimum  curvature  principle,  our  modifi¬ 
cation  notwithstanding,  suffers  from  a  severe  drawback:  It 
cannot  handle  sharp  turns  in  the  curve.  A  sharp  comer  will 
completely  dominate  the  curve,  and  the  value  of  the  integral 
will  depend  on  the  exact  shape  of  the  corner,  with  point-like 
edges  leading  to  infinities.  Even  a  small  bump  will  have  a 
considerable  influence,  and  in  fact,  the  smaller  the  bump, 
the  greater  its  effect  on  the  integral  will  be  (keeping  its 
shape  similar).  This  makes  En  inapplicable  for  curves  with 
edges,  or  noise,  without  elaborate  filtering  techniques.  On 
the  other  hand,  straight  lines  are  ignored  by  this  measure, 
regardless  of  how  long  they  may  be. 

We  propose  a  new  kind  of  an  extremum  principle,  one 
of  whose  advantages  is  essentially  solving  this  small  (and 
large)  scale  variation  problem.  (Another  advantage  will  be¬ 
come  apparent  when  we  go  over  to  3-D.)  Our  theory  is 
based  on  minimizing  a  functional  of  the  curve,  which  we 
call  “spring  energy” ,  as  it  is  analogous  in  many  respects  to 
the  energy  of  a  physical  spring.  In  addition  to  the  ability 
to  bend,  as  in  a  conventional  spline  model,  the  spring  has 
another  degree  of  freedom.  It  can  stretch  or  shrink  in  some 
uneven  manner,  creating  a  varying  density  of  coils  along 
it.  The  curve  can  now  by  reparametrized  by  a  variable  o, 
which  represents  the  density  of  coils  along  the  spring.  In 
a  discrete  representation  of  the  curve,  this. corresponds  to 
the  density  of  the  knot  points. 

In  the  following  we  shall  (usually)  adhere  to  the  follow¬ 
ing  terminology:  Knots,  or  knot  points  are  a  finite  number 
of  points  on  the  curve  itself,  which  represent  it.  All  other 


752 


curve  points  can  be  calculated  from  these.  Data  points  are 
given  by  a  sensor  of  some  kind.  Grid  points  are  a  fixed 
network  of  points  in  n-D  space,  independent  of  the  curve 
or  the  data.  “Mesh  points”  is  a  general  term  used  in  place 
of  either  knots  or  grid  points. 

The  spring’s  energy  consists  of  three  parts:  (a)  Bend¬ 
ing  energy,  which  is  related  to  the  curvature  of  the  curve. 
A  rapidly  varying  curve  bends  more  then  a  slowly  varying 
one,  and  thus  has  a  higher  bending  energy.  This  part  is 
similar  to  the  (normaled)  spline  energy  discussed  before. 

(b)  Stretching  energy.  This  component  represents  the  ad¬ 
ditional  degree  of  freedom  that  a  spring  has.  When  one 
stretches  a  spring  by  holding  it  at  both  ends,  the  spring 
will  stretch  uniformly,  as  this  is  the  state  of  minimum  en¬ 
ergy  under  these  conditions.  Any  deviation  from  uniformity 
will  increase  this  stretching  energy.  It  is  this  deviation  from 
uniformity  that  we  are  interested  in,  because  it  can  be  re¬ 
lated  to  a  nonuniform  placement  of  the  knot  points.  We 
can  associate  the  density  of  the  knot  points  with  the  den¬ 
sity  of  the  spring  coils,  so  that  a  high  concentration  of  coils 
in  one  section  means  a  larger  number  of  knot  points  there. 

(c)  Energy  resulting  from  external  forces  operating  on  the 
spring,  forcing  it  to  deviate  from  its  natural  form.  We  will 
assume  that  each  data  point  is  a  source  of  an  attraction 
force  operating  on  each  point  of  the  spring  representing 
the  curve,  with  the  force  being  stronger  on  curve  points 
that  get  closer  to  the  data  point.  In  minimizing  the  overall 
energy,  there  is  a  trade-off  between  its  three  components, 
and  it  turns  out  that  a  local  trade-off  between  the  bending 
and  stretching  energies  creates  a  desirable  placement  of  the 
knots.  This  will  be  shown  analytically  in  simple  cases  and 
will  be  demonstrated  numerically. 

Intuitively,  one  can  regard  the  attraction  energy  (c)  as 
the  energy  that  forces  the  spring  to  follow  the  data  points. 
There  will  be  bending  of  the  spring  in  places  where  the 
data  points  indicate  corners.  Because  the  bending  tends  to 
distribute  itself  uniformly  over  all  coils,  the  coils  will  tend 
to  concentrate  in  the  regions  of  high  bending.  That  makes 
the  amount  of  bending  per  coil  more  uniform,  which  re¬ 
duces  the  total  bending  energy  (a).  The  stretching  energy 
(b),  on  the  other  hand,  will  increase  when  the  coils’  spa¬ 
tial  distribution  deviates  from  a  uniform  one,  and  thus  a 
balance  is  created  between  the  tendency  of  the  coils,  or  the 
knot  points,  to  concentrate  in  high  bending  sections,  and 
their  tendency  to  be  uniformly  distributed.  This  reflects  a 
balance  between  the  wish  to  follow  the  curve  as  closely  as 
possible,  and  the  wish  to  smooth  the  data.  Unlike  other 
schemes,  however,  this  balancing  is  done  it  locally  at  each 
part  of  the  curve,  and  is  thus  much  more  responsive  to  the 
local  shape  of  the  curve,  than  in  schemes  which  try  to  trade 
off  between  smoothing  and  accurate  fitting  in  some  global 
way.  This  knot  placement  method  is  a  natural  part  of  the 
minimization  principle  and  the  placement  is  done  automat¬ 
ically  in  the  course  of  numerically  minimizing  the  energy 
functional. 

We  turn  now  to  the  mathematical  representation  of 
the  model.  Denoting  by  E  the  total  energy  of  the  curve 


(“spring”),  by  t  the  tangent  vector,  by  a  the  arclength,  and 
by  a  an  independent  parameter  along  the  curve,  analogous 
to  the  distribution  of  coils  along  the  spring,  the  energy  is 


where  Oo,Oi  are  the  values  of  a  at  the  curve's  ends,  k  is  a 
constant  factor  related  to  the  spring’s  flexibility,  L  —  ^ 

(the  total  arclength  divided  by  the  total  change  in  a)  is  used 
as  a  normalization  factor,  and  V  =  V  (i)  is  the  local  value 
of  the  external  potential  energy  (described  below). 

The  first  term  above  is  the  bending  energy,  the  second 
term  is  the  stretching  energy  and  the  third  term  represents 
the  influence  of  the  external  forces  acting  on  the  spring. 
By  definition,  the  force  acting  on  a  particle  is  the  gradi¬ 
ent  of  (the  negative  of)  the  potential  at  that  point.  The 
best  fitting  curve  is  now  the  one  that  minimizes  E,  over  a 
suitable  space  of  admissible  curves,  subject  to  some  bound¬ 
ary  conditions  at  the  end-points.  Given  the  potential  V  (x) 
and  the  boundary  conditions,  one  can  find  the  functions 
t(£),s(z),a(x)  which  minimize  the  functional  E,  namely 
find  both  the  curve  and  the  parameter  a  along  it. 

To  gain  more  intuitive  understanding  of  a,  we  will  ig¬ 
nore  the  third  term  for  the  moment,  and  assume  that  the 
curve  <j>(l)  (or  equivalently  t[l))  is  given.  The  only  unknown 
is  its  reparametrization  a.  It  can  be  seen  that  the  minimiza¬ 
tion  creates  two  forces  that  influence  the,  reparametriza¬ 
tion,  resulting  from  the  two  first  terms  in  E.  The  stretch¬ 
ing,  represented  by  the  second  term,  pushes  a  to  follow  l 
as  closely  as  possible  (thus  making  the  first  term  close  to 
the  notion  of  curvature).  The  bending,  on  the  other  hand, 
drives  a  to  follow  tj>,  so  that  a  large  change  in  </>,  as  in  a 
corner,  will  cause  a  correspondingly  large  change  of  a  in 
that  region. 

The  spring  is  “ideal”  in  the  sense  that  it,  or  part  of  it, 
can  be  shrunk  to  to  an  infinitesimal  length  (the  coils  have  no 
thickness).  This  is  what  will  happen  in  a  corner,  as  we  will 
see  more  precisely  below.  In  this  case,  a  finite  bending  A <f>  is 
concentrated  in  an  infinitesimally  small  length  of  the  curve. 
To  minimize  the  energy,  a  finite  “number  of  coils”  Aa  will 
concentrate  in  that  corner,  allowing  the  finite  bending  A <j> 
to  spread  over  it,  so  that  the  bending  energy,  which  depends 
on  the  ratio  A^/A a,  is  not  excessively  large. 

When  the  curve  is  not  given,  then  E,  should  be  min¬ 
imized  over  all  functions  t[a)  and  1(a),  with  all  possible 
parametrizations  a,  which  are  consistent  with  the  available 
data. 

The  analogy  to  a  spring  is  not  perfect,  though,  as  an 
actual  spring  usually  does  not  have  a  dimensionless  E. 

The  normalization  factor  L  is  important  in  two  ways: 
(i)  it  makes  E,  and  hence  the  whole  theory,  dimensionless. 
Thus  the  resulting  curve  will  be  independent  of  the  scale  of 
the  problem,  which  should  not  affect  the  curve  fitting.  This 
is  unlike  most  other  theories,  (ii)  In  placing  the  knots,  we 
are  only  interested  in  how  much  their  distribution  deviates 


753 


from  a  uniform  one,  and  we  want  this  deviation  to  affect 
the  (stretching)  energy.  We  do  not  want  energy  resulting 
from  a  uniform  stretching  of  the  spring  to  be  counted.  The 
normalization  by  the  total  arclength  eliminates  this  contri¬ 
bution.  The  deprojection  problem  considered  earlier  can 
demonstrate  this  point.  In  that  case,  we  are  given  an  el¬ 
lipse  as  the  projected  curve,  and  we  want  to  find  the  space 
curve  that  most  likely  has  given  rise  to  this  apparent  el¬ 
lipse.  A  planarity  assumption  restricts  us  to  an  ellipse  with 
an  unknown  slant.  Minimizing  E,  with  the  normalization 
factor,  will  yield  a  circle  as  the  ellipse  which  minimizes  the 
energy  over  the  space  of  all  possible  deprojected  ellipses,  be¬ 
cause  it  has  the  most  uniform  distribution  of  curvature  (and 
of  coil  density).  Without  normalizing,  however,  the  circle 
is  not  the  minimizer  because  a  lower  energy  state  can  be 
obtained  by  shortening  the  total  arclength,  even  at  the  ex¬ 
pense  of  deviating  from  uniformity.  It  is  this  contribution 
to  the  energy  from  change  in  the  arclength  that  is  elimi¬ 
nated  by  normalizing  the  arclength.  The  circle  is  the  more 
desirable  result  as  it  is  consistent  with  the  other  theories 
mentioned  above,  as  well  as  with  human  perception. 

The  potential  V(x)  is  determined  as  follows:  Let  us 
assume  that  attached  to  each  point  xd  of  the  given  data  is 
an  attractive  potential  g  =  g(x  —  xd),  which  represents  the 
force  that  this  data  point  exerts  on  a  curve  point  lying  at 
x.  Then,  the  influence  of  the  entire  given  picture  at  each 
point  x  is  a  convolution  of  g  with  the  picture  /: 

V(x)  =  /  f(xd)g(x-xd)dxd 

For  a  discrete  data  set  the  integral  is  of  course  replaced  by 
a  summation. 

In  least  squares  fitting,  g  is  proportional  to  r2  (r  being 
the  distance  \x  —  xd\).  A  better  choice  is  an  /  that  decreases 
with  distance  such  as  a  Gaussian,  a  Gaussian  divided  by  r, 
or  a  Lorentzian  function.  The  attraction  force  can  be  found 
by  differentiating  V. 

We  will  now  examine  the  behavior  of  E,  more  precisely 
in  various  cases  where  the  extremization  can  be  done  ana¬ 
lytically,  and  show  that  it  meets  our  demands. 

For  the  interpolation  problem,  we  have  to  consider 
curves  which  satisfy  given  boundary  conditions,  and  V  is 
unnecessary.  If  we  assume  k  =  0  in  E,  we  obtain  the  case 
discussed  in  the  previous  section.  In  the  general  case,  we 
can  again  use  the  EL  equations.  We  will  now  have  two  EL 
equations  from  which  we  can  determine  both  these  func¬ 
tions,  and  thus  find  both  the  curve  function  t(l)  and  the 
new  coordinate  a(l).  The  two  EL  equations  are  very  sim¬ 
ple  and  have  the  solution 

l  =  a 

<t>  =  (fa  ~  fa)  «  +  <t>i 

Substituting  l  for  a  in  the  second  equation  we  immediately 
obtain  the  circular  arc  we  had  before.  Thus,  our  new  mod¬ 
ifications  at  least  did  not  destroy  the  desirable  results  we 
have  had  before  with  En. 

We  shall  first  examine  the  behavior  of  E,  in  a  sim¬ 


ple  curve  containing  small  and  large  scale  variations,  and 
compare  it  to  the  behavior  of  the  former  energy  En.  This 
curve  is  a  simple  isolated  corner.  In  the  next  section  we 
shall  use  the  results  obtained  here  for  handling  a  full  shape 
containing  corners,  namely  a  parallelogram. 

Consider  two  straight  segments  of  length  Al,  joined  by 
a  small  circular  arc  of  length  A/',  and  making  an  angle  A 4> 
with  each  other  (Fig.  l).  It  is  trivial  to  prove  from  the 
EL  equations  that  on  a  straight  line  l  must  be  proportional 
to  a,  and  we  have  already  proven  that  on  a  circular  arc,  l, 
a,  4>  are  all  proportional  to  each  other.  Thus,  substituting 
these  results  in  the  general  expression  for  E,,  the  integra¬ 
tion  over  the  separate  sections  can  be  performed  explicitly 
and  E,  can  be  expressed  by  the  finite  increments  in  /,</>,  a, 


as  follows: 


A  l2  A  l'2  A<f>2 

E,  =  2— - H  — — -  +  — — - 

Act  A  a'  Aa’ 


where  A  a,  A  a'  are  the  increments  of  a  over  A/,  A/'  respec¬ 
tively.  We  want  to  extremize  E,  with  respect  to  a,-,  subject 
to  the  constraint 

2Aa  +  Aa'  =  1 


An  elementary  calculation  shows  that  the  extremum  is  ob¬ 
tained  with 


a  Al  a  /_  R 

Aa  2Af  +  R  Aa  2Af  +  R 


where  R  =  \J Al'2  +  A<j>2. 

Substituting  these  quantities  in  E,  we  obtain  in  the 
extremum 

E,  =  (2A(  +  R)2 

For  a  small  corner,  Al'  -C  A <f>  and  Al  «  1/2.  We  thus  find 
the  simple  result 

E.  «  (1  +  A<j>)2 


This  expression  is  independent  of  the  exact  shape  and  size 
of  the  corner,  and  depends  only  on  the  total  turn  A <f>.  The 
corner  region  Al  can  be  as  small  as  we  like  without  ad¬ 
versely  affecting  the  energy  E,. 

In  comparison,  the  simple  curvature  minimization 
principle  will  give  for  this  corner 


which  tends  to  infinity  as  the  length  of  the  corner  Al'  goes 
to  zero.  Although  we  have  dealt  with  an  isolated  corner, 
similar  methods  can  be  applied  in  more  general  cases,  one 
of  which  is  treated  in  the  next  section. 

In  conclusion,  we  have  seen  the  different  roles  of  the 
two  terms  in  E,.  The  first  term  will  be  rather  dominant 
in  sections  of  the  curves  having  rapid  curvature  changes 
over  short  length,  such  as  bumps,  corners,  or  saw-teeth. 
The  second  term  will  be  the  major  contributor  in  the  large- 
scale  features,  having  slow  variations  in  curvature,  such  as 
the  other  boundaries  of  the  saw.  Thus  we  have  obtained 
a  “natural"  way  of  treating  curves  with  two  very  different 
scales  of  curvature  change,  without  having  to  use  arbitrarily 
preset  filtering  widths,  as  is  commonly  done. 


754 


One  may  refine  the  way  the  arclength  is  normalized. 
The  normalization  to  a  total  length  of  1  is  not  essential. 
One  can  define  l  =  ua/L,  where  is  is  an  arbitrary  positive 
constant.  This  results  in  multiplying  the  second  term  in 
E„  representing  the  bending  energy,  by  «/3.  In  this  way 
one  can  control  the  relative  contributions  of  the  bending 
and  stretching  energies  of  the  spring  (instead  of  «).  The 
circular  case,  however,  will  not  be  affected. 

IV.  SKEW-SYMMETRY 

As  an  application  of  the  ability  of  our  new  measure  E, 
to  handle  sharp  corners,  one  can  consider  the  deprojection 
of  a  parallelogram.  If  we  regard  this  curve  as  a  planar 
shape  which  is  slanted  and  tilted  in  space,  the  question 
arises  what  is  the  most  likely  shape  that  has  given  rise  to 
this  apparent  parallelogram.  According  to  our  minimum 
principle,  the  most  likely  shape  is  the  one  that  minimizes 
the  spring  energy  Et 

Not  surpr  ;ngly  the  result  turns  out  to  be  a  rectangle. 
This  agrees  with  the  maximum  compactness  method,  the 
maximum  likelihood  method  and  the  human  observation 
“method".  The  first  two,  however,  have  very  restricted 
domains  of  applicability. 

It  is  only  necessary  to  prove  that  a  rectangle  extremizes 
our  energy  E,  over  the  set  of  parallelograms,  i.e.  that  the 
derivative  of  the  rectangle’s  energy  with  respect  to  the  un¬ 
knowns  vanishes.  The  unknowns  here  are  the  distribution 
of  a  along  the  curve,  and  the  angle  by  which  the  parallelo¬ 
gram  deviates  from  a  rectangle.  This  can  be  proven  using 
the  previous  result  [Weiss,  1986a] . 


V.  GENERAL  SURFACES  IN  3-D 


Sc  Tar  we  have  dealt  with  planar  curve,  described  by 
one  coordinate.  We  now  turn  to  the  general  case  of  a  (rea¬ 
sonably)  arbitrary  surface  in  3-D.  Such  a  surface  can  be 
parametrized  by  two  coordinates  on,  02-  Our  energy  E, 
will  now  be  a  double  integral: 


+  v 


daidat 


where  t,  and  li  are  the  normalized  length  and  the  tangent, 
respectively,  of  the  curve  parametrized  by  a,-. 

Returning  to  our  “spring"  model,  one  can  visualize  the 
surface  as  made  up  of  a  grid  of  springs,  which  form  its  nat¬ 
ural  coordinates,  with  cti  being  the  number  of  coils  between 
a  surface  point  and  the  boundary,  along  a  coordinate. 

Given  a  set  of  known  contours,  either  on  the  boundary 
or  otherwise,  and  suitable  information  about  the  surface 
normal  on  them,  we  reconstruct  the  surface  that  best  fits 
these  contours  as  the  one  that  extremizes  the  above  inte¬ 
gral  (over  all  the  surfaces  that  comply  with  the  data).  The 
cti  will  now  have  a  more  visual  meaning  then  in  the  one- 
parameter  case:  they  will  be  a  pair  of  natural  coordinates 
parametrizing  the  surface.  Although  general  experiments 
have  not  been  conducted,  in  several  important  cases,  such 


as  planar  geodesics  discussed  later,  these  curves  seem  to  be 
the  ones  that  our  visual  intuition  would  expect  as  natural, 
and  so  we  refer  to  them  as  “natural  coordinates".  This 
is  another  advantage  of  the  new  procedure  over  the  sim¬ 
pler  energy  principle.  It  is  interesting  that  the  variables  a, 
can  serve  the  dual  purpose  of  both  handling  bumps  (and 
straight  lines)  and  providing  a  set  of  3-D  natural  coordi¬ 
nates. 

A  simple  example  is  that  of  circular  boundary  which  is 
assumed  to  be  occluding.  It  is  a  common  assumption  that 
for  a  smooth  surface,  the  surface  normal  along  an  occlud¬ 
ing  contour  lies  in  the  projection  plane.  It  is  easy  to  prove 
that  the  this  boundary  will  give  rise  to  a  (hemi-)sphere, 
complete  with  its  longitudinal  and  latitudinal  coordinates, 
which  will  be  represented  by  ai  and  0:2  •  Rather  than  do 
that,  we  shall  prove  a  general  theorem  about  curves  on 
surfaces,  which  will  be  very  useful  in  finding  natural  co¬ 
ordinates  as  well  as  the  surfaces  themselves  (including  the 
sphere).  Of  course,  for  generic  data  the  minimization  can 
only  be  done  numerically. 

We  first  define  a  curve  of  local  symmetry  T  as  one  which 
divides  the  surface  into  two  parts  that  are  symmetric  in  the 
vicinity  of  the  curve.  Put  another  way,  a  reflection,  say  of 
the  surface  lying  to  right  side  of  T,  will  match  the  left  side, 
near  T.  An  example  is  the  center  line  of  any  fold,  ridge, 
valley  or  corrugation  on  a  surface  (Fig  2),  if  this  line  is 
planar.  Another  example  is  the  three  symmetry  lines  on  an 
ellipsoid  (this  is  a  global  symmetry). 

An  alternative  definition  is  that  the  curvature  in  the 
direction  perpendicular  to  the  curve  has  an  extremum  at 
each  point  along  the  curve. 

Theorem  1:  A  curve  of  local  symmetry  is  planar. 

Proof:  Trivial.  A  reflection  of  the  right  side  of  T  will 
not  match  the  left  side  unless  T  is  planar  (Fig  2). 

Theorem  2:  A  local  network  of  coordinates,  consisting 
of  a  local  symmetry  curve  T  and  curves  that  are  orthogonal 
to  it,  locally  extremizes  the  energy  E,. 

By  “locally  extremize"  we  mean  that  an  infinitesimal 
variation  of  this  coordinate  patch  in  the  vicinity  of  T  will 
leave  E,  unchanged  (i.e.  SE,  =  0).  The  proof  is  in  [Weiss, 
1986a]. 

A  very  interesting  class  of  surfaces  to  which  our  theo¬ 
rem  is  easily  applicable  is  the  class  having  an  “ignorable  co¬ 
ordinate”  ,  namely,  its  curvatures  do  not  change  as  we  move 
along  this  coordinate.  As  sub-classes  one  can  mention  (a) 
strips,  or  pipes,  of  arbitrary  cross  section,  whether  straight, 
circularly  or  helically  bent  (Fig  3),  in  which  the  ignorable 
coordinate  is  the  length  of  the  pipe,  and  (b)  surfaces  of  revo¬ 
lution,  in  which  case  the  azimuth  <f>  is  “ignorable”  (Fig  2a). 
Any  coordinate  line  orthogonal  to  the  ignorable  direction 
can  be  regarded  as  a  local  symmetry  curve,  as  the  surface 
stays  the  same  on  its  sides.  Thus,  this  curve,  and  the  curves 
in  the  ignorable  direction  that  intersect  it,  form  a  local  net¬ 
work  of  curves  that  satisfy  the  conditions  of  Theorem  2, 
and  thus  it  locally  extremizes  E,. 

A  further  consequence  of  Theorem  2  is  the  ability  to 
make  predictions  about  surfaces  when  they  are  not  known 


755 


in  advance,  without  having  to  actually  extremize  E,  or  solve 
the  EL  equations.  It  is  clear  that  generally,  the  greater  the 
number  of  extremal  local  coordinate  patches  a  surface  has, 
the  more  “extremal”  the  global  2-dimensional  integral  E, 
will  be.  (This  would  be  clearer  if  this  “extremum”  were  a 
“minimum”,  which  it  usually  is,  but  as  we  have  not  exam¬ 
ined  here  the  second  derivatives,  we  shall  stick  to  the  term 
“extremum” .)  Thus,  a  surface  that  extremizes  E,  will  con¬ 
tain  as  many  symmetry  curves  as  are  compatible  with  the 
initial  data.  As  a  consequence,  we  will  tend  to  obtain  sur¬ 
faces  containing  planar,  spherical  or  cylindrical  parts,  exact 
or  approximate,  rather  than  bumpy  ones.  As  a  particular 
example,  we  can  use  this  general  result  to  conclude,  without 
having  to  solve  the  EL  equations,  that  a  circular  occlud¬ 
ing  boundary  will  give  rise  to  a  sphere.  This  is  because  a 
sphere  is  the  most  symmetric  surface  (consistent  with  this 
boundary  condition),  thus  having  the  greatest  quantity  of 
symmetry  curves.  More  generally,  boundary  contours  such 
as  those  of  Fig.  2a  will  give  rise  to  a  surface  of  revolution, 
because  this  surface  consists  of  a  collection  of  symmetric 
local  networks  (the  meridians  plus  the  parallels)  which  fits 
those  contours.  Similar  arguments  apply  for  other  surfaces. 

It  is  reasonable  to  assume  that  when  the  condi¬ 
tions  of  the  theorem  are  satisfied  only  approximately,  e.g. 
when  the  curve  is  only  approximately  symmetric,  a  simi¬ 
lar  parametrization  will  still  take  place,  with  the  natural 
coordinates  approximately  following  the  quasi-symmetry 
curves.  Thus  our  theory  is  applicable  to  shapes  such  as 
generalized  cones,  as  long  as  the  flutings  vary  slowly  on  the 
length  scale  of  the  cone’s  radius.  This  is  also  consistent 
with  human  perception  (Fig.  4). 

As  in  the  2-D  case,  one  can  incorporate  into  the  re¬ 
construction  data  from  curves  and  points  not  lying  on  the 
surface,  by  adding  to  E,  terms  representing  an  attraction 
energy  between  the  surface  and  this  data. 

VI.  DISCUSSION  AND  RELATION 
TO  OTHER  WORK 

In  the  preceding  sections  we  have  suggested  a  new 
shape  reconstruction  method,  with  varying  mesh,  and 
solved  it  analytically  in  a  few  cases.  These  cases  were  solv¬ 
able  mainly  because  of  their  symmetry.  The  notion  of  sym¬ 
metry  plays  a  central  role  in  other  work  as  well.  We  have 
already  mentioned  the  Brady- Yuille  compactness  measure, 
used  for  the  deprojection  of  closed  planar  curves.  As  com¬ 
pactness,  or  the  ratio  of  area  to  the  perimeter  squared,  is 
generally  higher  for  more  symmetric  shapes,  the  compact¬ 
ness  can  be  viewed  as  a  good  measure  of  global  symmetry 
(as  distinct  from  our  local  symmetry). 

Brady  &  Asada  [1984]  have  studied  local  symmetries  in 
planar  shapes.  Strictly  speaking,  our  definition  of  “local” 
applies  to  an  infinitesimal  vicinity  around  a  curve.  In  this 
sense,  every  straight  line  on  a  plane  is  a  local  symmetry 
curve,  except  at  its  edges.  Brady  examined  wider  vicini¬ 
ties,  that  extend  to  the  nearby  boundaries  of  the  curve, 
so  we  shall  elevate  the  type  of  symmetry  he  treated  to  a 
“regional"  symmetry.  Brady  has  shown  that  this  kind  of 
symmetry  is  very  useful  in  representing  and  handling  many 
types  of  shapes.  This  indicates  that  symmetry  is  of  fun¬ 


damental  importance  at  all  spatial  scales  (global,  regional, 
and  local)  of  shape  representation.  We  used  symmetry  in  an 
analytic  treatment  of  shapes  in  which  it  strictly  holds,  but 
its  application  is  probably  much  wider:  symmetric  shapes 
can  be  used  as  a  first  approximation  in  a  numerical  compu¬ 
tation  of  more  general  shapes.  This  will  potentially  increase 
the  computational  efficiency  in  reconstructing  shapes  which 
possess  some  axial  symmetry,  such  as  “generalized  cylin¬ 
ders”  ,  or  that  can  be  decomposed  into  such  shapes. 

In  differential  geometry  terms,  our  local  symmetry 
curves  are  planar  geodesic  ones.  Brady  et  al.  [1984]  and 
Stevens  [1981]  have  shown  that  planar  geodesics  are  usually 
“natural*  coordinates.  However,  not  all  natural  coordinates 
axe  planar  geodesics.  For  example,  the  parallels  on  a  sur¬ 
face  of  revolution,  and  the  spirals  on  a  helicoid  (Fig.  5), 
are  not  planar  geodesics  [DoCarmo,  1976].  The  question 
why  these  look  natural  was  left  open,  as  they  did  not  seem 
to  fit  any  consistent  rule.  In  our  theory,  these  curves  are 
natural  coordinates  by  virtue  of  their  being  orthogonal  at 
each  point  of  their  length  to  planar  geodesies,  namely  the 
meridians  and  the  rulings,  respectively.  This  one  rule  fits 
all  the  above  cases.  Moreover,  unlike  the  previous  work,  our 
results  are  part  of  a  general  theory  of  shape  representation 
derived  from  sound  first  principles. 

The  above  mentioned  papers  contain  many  numerical 
techniques  for  finding  symmetries,  geodesics,  etc.  Their 
agreement  with  our  work  suggests  that  many  of  these  tech¬ 
niques  will  be  very  useful  in  a  computational  implements^ 
tion  of  our  theory. 

Although  we  have  only  examined  reconstruction  from 
spatial  data,  such  as  points  and  curves,  other  types  of  in¬ 
formation  can  be  included  in  the  energy  functional  to  be 
minimized.  It  was  shown  [Terzopoulos,  1985]  that  shading, 
stereo  and  other  data  can  be  cast  in  an  energy  form  and 
treated  as  a  variational  problem.  Constraints  suplied  by 
a  knowledge-  based  system  can  also  be  added.  For  exam¬ 
ple,  one  can  add  a  term  to  the  energy  measuring  how  close 
the  shape  is  to  a  given  template.  Investigations  of  these 
possibilities  are  currently  under  way. 

VII.  NUMERICAL  EXPERIMENTS: 

INTRODUCTION 

In  the  following  sections  we  will  describe  our  numeri¬ 
cal  experiments  with  the  method.  A  subset  of  the  surface 
interpolation  problem  is  the  fitting  of  curves  to  given  data 
points,  either  in  the  plane  or  in  2-D  space.  This  subset 
retains  many  of  the  important  properties  of  the  full  3-D 
problem.  We  have  implemented  this  restricted  problem  nu¬ 
merically.  Thus,  we  have  been  able  to  invsstigats  many - 

of  the  major  issues  of  the  numerical  implementation  of  the 
full  3-D  problem,  with  the  program  being  much  easier  to 
develop  and  much  faster  to  run.  Some  of  the  results  and 
lessons  learned  from  this  implementation  are  presented  in 
this  report.  A  subsequent  report  will  contain  the  full  3-D 
implementation. 

The  2-D  implementation  has  merits  in  i's  own  right. 

It  defines  a  new  method  of  interpolation  in  which  the  mesh 
size  is  automatically  and  optimally  adapted  to  the  local 


756 


vnrifttirn"Trf--Hiri  fiirvn,  with  a  finer  mesh  being  used  in 
rapidly  changing  sections- ofthc  i  in  an. 

Our  experience  with  the  numerical  implementation  has 
led  us  to  make  a  number  of  refinements  to  this  basic  model, 
that  do  not  change  the  basic  principles.  The  main  ones  are: 

•  The  use  of  the  normalization  factor  makes  the  prob¬ 
lem  a  global  one,  as  a  change  in  one  part  of  a  curve  changes 
the  energy  everywhere,  through  L.  This  is  undesirable,  es¬ 
pecially  if  parallel  computation  is  to  be  implemented.  One 
can  circumvent  the  problem  by  treating  L  as  an  indepen¬ 
dent  variable,  with  the  normalization  being  imposed  as  a 
constraint,  which  is  added  to  the  energy  with  a  Lagrange 
multiplier  A.  Thus,  a  fourth  term  is  added  to  the  energy, 
in  the  form 

ifj 

The  constant  on  —  a0  does  not  need  to  be  added  to  the 
functional  E,  because  the  minimization  will  not  change  it, 
while  the  integral  part  joins  the  other  integrals  in  E.  E  has 
now  to  be  minimized  with  respect  to  L,  as  well  as  with 
respect  to  the  curve,  and  the  above  constraint  equation 
has  to  be  satisfied.  These  added  two  equations  enable  us 
to  determine  L  and  A.  The  problem  is  still  global,  but 
the  global  calculations  have  been  isolated  to  these  last  two 
equations,  and  they  do  not  need  to  be  done  for  each  knot 
point,  which  is  a  great  relief. 

It  can  be  shown  that  for  a  small  attraction  potential, 
and  a  nearly  uniform  knot  distribution,  the  constraint  equa¬ 
tion  is  approximately  satisfied  if  A  ra  -2.  Thus,  by  substi¬ 
tuting  this  value,  the  amount  of  global  calculation  is  further' 
reduced. 

•  As  t  =  dx/ds,  the  bending  term  can  be  written  ex¬ 
plicitly  as 


iFrom  the  point  of  view  of  the  basic  principles,  other  forms 
of  dependence  of  x  on  a,  a  are  acceptable.  For  instance,  one 
can  reverse  their  order.  These  variables  are  not  independent 
and  the  exchange  will  change  the  value  of  the  bending  term. 
One  can  prove  that 


A  linear  combination  of  the  two  possibilities  is  also  accept¬ 
able.  The  form  in  the  first  line  above  has  proven  preferable 
numerically.  Furthermore,  one  can  get  rid  of  «  altogether 
in  the  bending  term,  and  define  the  curvature  purely  as 
a  function  of  a.  By  dropping  the  factor  involving  da /da 
above,  the  curvature  becomes  simply  ,  with  its  square 
being  the  bending  energy.  This  will  strengthen  the  effect  of 
concentration  of  points  in  corners,  as  we  shall  see. 


Vni.  CHOICE  OF  CURVE  REPRESENTATION 

So  far  we  have  dealt  with  idealized  curves  having  an 
infinite  number  of  degrees  of  freedom.  For  numerical  imple¬ 
mentation  we  have  to  choose  a  finite  representation.  Several 
questions  arise,  which  do  not  arise  in  standard  interpolation 
schemes.  Normally,  one  has  knot  points  at  fixed  positions 
along  the  x  axis,  so  that  the  x  values  are  known  and  one 
seeks  the  y(x)  values.  In  our  case  the  knot  points  are  dis¬ 
tributed  along  the  the  curve  in  accordance  with  the  param¬ 
eter  a,  which  is  itself  unknown.  Both  x  and  y  are  functions 
of  a.  Under  these  conditions,  how  shall  we  construct  the 
mesh  and  how  shall  we  perform  the  discretization?  What 
shall  we  choose  as  independent  variables  and  which  are  the 
dependent  ones?  Additionally,  what  is  the  best  way  to  cal¬ 
culate  the  derivative  with  respect  to  a? 

The  knot  points  have  been  handled  as  follows:  Our  un¬ 
knowns  are  the  x,  y  coordinates  of  n  knot  points,  namely 
xi , . . . ,  x„,  (2n  variables) .  More  accurately,  as  we  shall  see, 
we  use  an  equivalent  linear  combination  of  these  values 
as  unknowns.  These  unknown  points  are  assumed  to  be 
equally  spaced  with  respect  to  the  a  parameter,  with  the 
interval  length  Act  given,  Ao  =  We  thus  obtain 

E  as  a  function  of  the  unknowns  x,-.  This  has  turned  the 
problem  into  a  minimization  of  £(x,)  with  respect  to  x,-. 

Given  Xi(a),y,(a),  the  functions  x(a),y(a)  and  their 
derivatives  with  respect  to  a  cam  be  approximated  in  sev¬ 
eral  ways.  The  particular  way  used  can  have  a  profound 
effect  on  the  stability  and  convergence  of  the  algorithm. 
While  we  have  not  done  a  thorough  comparative  study  of 
the  merits  of  the  various  methods,  we  have  considered  the 
main  approaches.  We  used  the  general  guidance  of  the  lit¬ 
erature  to  narrow  down  our  choice,  then  tried  two  methods 
for  a  while  until  we  settled  on  one. 

First,  we  have  the  choice  between  the  finite  difference 
method  and  the  finite  element  method.  The  former  is  based 
on  a  straightforward  substitution  of  each  derivative  by  its 
finite  difference  analog,  e.g.  dx/da  is  replaced  by  Ax/ A  a. 
The  method  has  the  advantage  of  a  simple  implementation. 
However,  it  has  a  greater  potential  for  instabilities  then 
other  methods,  and  it  does  not  give  the  value  of  the  curve 
between  the  knot  points,  which  makes  it  inconvenient  for 
an  interpolation  scheme.  So  we  opted  for  the  finite  element 
method.  This  method  consists  of  approximating  the  func¬ 
tion  as  a  superposition  of  (usually  local)  basis  functions. 
These  basis  functions,  which  are  typically  piecewise  poly¬ 
nomials,  can  easily  be  differentiated  so  that  the  derivative  of 
the  function  at  each  curve  point  can  be  obtained.  (We  have 
two  separate  functions  here:  x(a),y(ot).)  Through  the  min¬ 
imization,  one  can  find  the  coefficients  of  the  superposition. 
These  are  related  through  a  known  linear  transformation  to 
the  values  of  the  functions  at  the  knot  points,  x,-,  which  are 
usually  of  more  interest. 

Now  one  has  to  choose  appropriate  basis  functions. 
The  simplest  possibility  is  using  triangular  (“roof'’)  basis 
functions.  This  is  equivalent  to  representing  the  curve  by 
a  chain  of  line  segments.  One  can  use  the  slopes  of  the 


757 


lines  ((*<  -  *i_i)/Aa,  (y<  -  y,_i)/Aa)  as  the  first  deriva¬ 
tives,  while  the  angle  between  adjacent  segments  is  related 
to  the  second  derivative,  or  curvature.  Thus  we  have  all  the 
derivatives  we  need  as  functions  of  the  unknowns  x,  and 
the  implementation  is  quite  straightforward  in  this  respect. 
The  decomposition  into  basis  functions  does  not  need  to  be 
done  explicitly  in  this  case.  The  superposition  coefficients 
are  simply  the  x,-,  the  values  of  the  function  at  the  knot 
points.  (The  first  derivatives,  but  not  the  second  ones,  are 
the  same  here  as  in  finite  difference  method). 

This  method  was  tried  for  a  while,  and  we  encountered 
some  problems.  In  certain  situations,  particularly  around 
corners,  which  is  the  case  in  which  we  are  most  interested, 
the  knot  points  are  quite  close  to  each  other,  as  expected. 
The  line  segments  connecting  them  are  quite  short.  For  a 
very  short  segment,  any  small  movement  in  the  positions 
of  its  ends  causes  a  considerable  change  in  its  inclination. 
In  other  words,  when  the  length  of  the  line  segment  tends 
to  0,  the  derivative  dE/dx tends  to  infinity.  This  is  easily 
seen  mathematically,  as  this  derivative  contains  a  factor  of 
1/As,-  (As,-  being  the  length  of  the  line  segment.)  This  large 
derivative  can  cause  a  problem  for  the  minimization  tech¬ 
nique,  which  needs  these  derivatives.  It  also  means  that  the 
curve  can  become  unstable  around  corners,  with  its  exact 
shape  affected  by  small  errors  in  x,.  This  is  especially  no¬ 
ticeable  with  the  limited  accuracy  of  32-bit  single  precision 
arithmetic. 

The  situation  is  not  hopeless.  It  has  been  alleviated 
by  working  in  double  precision,  and  by  limiting  the  seg¬ 
ments’  length.  Also,  the  derivatives  dEjdii  have  been 
computed  analytically  rather  then  numerically.  Neverthe¬ 
less,  we  found  it  more  convenient  to  work  with  a  differ¬ 
ent  scheme,  which  is  inherently  more  stable  in  this  respect, 
namely  choosing  cubic  splines  as  our  basis  functions. 

Here  again  one  has  a  choice.  One  possibility  is  to 
use  the  cubic  Hermite  polynomials  [Strang  and  Fix,  1973j. 
These  basis  functions  are  defined  on  a  very  small  local  sup¬ 
port,  namely  one  that  consists  of  the  two  mesh  intervals 
[aj_i,a,-+i]  around  a  mesh  point  with  a,,  and  they  vanish 
outside  of  this  domain.  There  are  two  independent  basis 
function  attached  to  each  knot  point,  and  the  coefficients 
multiplying  them  are  linearly  related  to  the  value  of  the 
function  and  its  first  derivative  at  that  knot  point.  The 
basis  function  thus  depends  on  the  local  properties  of  the 
curve  only.  In  all,  we  have  2n  unknowns  for  x(a),  and  sim¬ 
ilarly  for  y(a).  Representing  the  curve  by  Hermite  cubic 
polynomials  would  be  relatively  simple  because  of  the  local 
nature  of  the  superposition,  but  we  would  have  to  minimize 
E  with  respect  to  both  x<  and  the  derivatives  dii/da. 

One  way  to  deal  with  the  unknown  derivatives  is  to  ap¬ 
proximate  them  by  the  known  values  of  the  function  in  the 
vicinity,  using  a  finite  difference-like  method.  The  Bessel 
and  the  Akima  cubic  splines  [DeBoor,  1978]  Me  obtained 
this  way.  We  have  not  tried  these  methods  but  we  feel  that 
they  do  deserve  a  fair  chance. 

The  superposition  of  Hermite  polynomials  is  continu¬ 
ous  and  has  a  continuous  first  derivative.  (This  is  because 


the  basis  functions  and  their  derivatives  vanish  at  the  ends 
of  the  intervals  they  are  supported  on.)  One  can  reduce 
the  number  of  unknowns  by  requiring  higher  order  conti¬ 
nuity,  e.g.  a  continuous  second  derivative.  A  price  is  paid 
by  reducing  the  locality  of  the  representation.  In  this  tech¬ 
nique  the  curve  is  represented  as  a  superposition  of  cubic 
B-splines.  These  basis  function  are  piecewise  cubic  polyno¬ 
mials  supported  on  a  larger  interval  Oi+2]  around  the 
knot  point.  There  is  only  one  possible  basis  function  at  each 
knot  point,  reducing  the  number  of  coefficients,  and  thus  of 
unknowns,  to  n  for  each  component  (x,y).  The  relation  be¬ 
tween  the  superposition  coefficients  and  the  curve  function 
values  is  no  longer  local.  At  each  knot  point,  a  contribu¬ 
tion  to  the  superposition  arises  from  the  basis  functions 
attached  to  the  two  neighboring  knots.  This  makes  the  cal¬ 
culation  at  each  point  more  complex  and  time  consuming. 
In  our  evaluation,  though,  this  is  more  than  offset  by  the 
reduction  in  the  number  of  unknowns.  Thus  we  eventually 
decided  to  use  cubic  B-splines  as  the  basis  functions  for  our 
curve  components.  These  functions  have  vanishing  values, 
first,  and  second  derivative  at  the  interval  edges  and  thus 
the  superposition  is  twice  differentiable.  Another  incentive 
is  the  inherent  relation  between  these  splines  and  minimal 
curvature.  It  can  be  shown  that  among  all  twice  differ¬ 
entiable  functions  which  pass  through  known  data  points, 
a  piecewise  cubic  polynomial  has  the  lowest  overall  curva¬ 
ture.  This  statement  refers  to  the  functions  x(a),y(o).  The 
implication  for  the  function  y(x)  is  not  immediately  clear. 

The  next  section  contains  a  more  precise  mathematical 
description  of  our  spline  representation. 

DC.  THE  SPLINE  REPRESENTATION 

A  major  distinction  between  our  representation  and 
most  other  spline  fittings  of  a  curve  is  that  in  our  case  x 
and  y  are  both  piecewise  polynomials  in  some  variable  a, 
rather  then  y  being  a  polynomial  in  x.  This  has  advantages 
in  representing  non-polynomial  curves,  such  as  circles,  for 
which  splines  a  are  notoriously  poor  approximation.  In  our 
representation  the  circle  can  be  described  by  the  coordi¬ 
nates  y  =  cos  a,  x  =  sin  a.  The  sin  and  cos  functions 
are  very  well  approximated  by  polynomials  (their  Taylor 
expansions,  for  instance),  and  thus  the  splines  are  a  good 
approximation.  This  is  in  contrast  to  the  circle  being  rep¬ 
resented  as  y  =  Vl  —  **,  which  is  hard  to  approximate  by 
polynomials.  The  same  goes  for  other  conic  sections.  We 
stress  that  a  is  not  the  length  between  data  points  as  in 
some  applications,  but  an  unknown  function  of  x,y  to  be 
found  by  a  minimization  process. 

Denoting  the  basis  function  at  knot  point  »  by  Bi(a ), 
and  its  (two  valued)  coefficient  by  tT»,  the  curve  can  be  writ¬ 
ten  as 

»+ 1 

*(<*)  = 

«'= 0 

Based  on  the  form  of  the  the  functions  x  can  be  cal¬ 
culated  at  each  interval  [x,-,Xj+ 1]  as  follows  [DeBoor,  1978, 
Barnhill  and  Riesenfeld,  1974].  Define  on  each  interval  a 
parameter  t  =  (a  —  a,)/(at+i  —  a,),  running  from  0  to 


758 


1  proportionally  to  a.  We  now  have  curve  segments  x,(t) 
with 

*,(0)  =  £i 

*«(  1)  =  *i+i 

On  each  interval  the  curve  is  (for  knot  points  equidistant 
in  a) 


*<(*)  —  if 

where  (C)  is  the  matrix 


3  -3  1\ 

-6  3  0 

0  3  0 

4  10/ 


We  shall  also  need  the  first  and  second  derivatives, 
which  are  easily  obtained  from  the  above  equations,  giv¬ 
ing 


dxj(a) 

da 

d2Xj(a) 

da 2 


=  ^5.  (6‘- 2- °>  °){C/)(wi-r,  «i,  fli+i,  *<+a)r 


It  remains  to  deal  with  the  boundary  conditions.  As 
the  value  of  the  function  at  each  knot  point  depends  on  the 
contributions  from  the  B-splines  attached  to  the  neighbor¬ 
ing  points,  a  problem  arises  at  the  edge  points.  Note  that 
in  the  superposition  above  the  limits  are  from  0  to  n  +  1, 
with  the  knot  points  indexed  by  t  =  l,...,n  only.  One 
needs  an  extra  assumption  for  determining  tfo,tf<+i.  This 
can  be  regarded  as  extending  the  guiding  polygon  in  some 
reasonable  way.  A  common  assumption  is  that  the  curva¬ 
ture  at  the  edges  vanishes.  This  can  be  approximated  by 
requiring  a  vanishing  second  derivative.  iFrom  the  expres¬ 
sions  above  for  the  derivatives  it  is  easy  to  prove  that  under 
this  assumption 


vo  =  2t>i  —  ffj 
Vn+l  =  2 v„  -  Vn-t 

We  found  it  more  suitable  to  assume  a  constant  curvature 
at  the  ends,  rather  then  zero.  This  is  especially  applicable 
to  circles,  or  circular  arcs,  which  are  a  part  of  many  curves, 
at  least  in  small  neighborhoods.  This  assumption  means  a 
zero  third  derivative,  which  leads  to 


Vo  —  3«1  —  3t>2  ~t“  V3 
vn+l  =  Vn_2  —  3vn_i  +  3t>„ 


Another  quantity  needed  in  £  is 


One  can  see  that  knowing  the  coefficients  v,-  is  equiv¬ 
alent  to  knowing  the  knot  points  x<.  The  formula  above 
makes  the  linear  transformation  from  Vj  to  x,-  immediate. 
The  reverse  calculation,  from  x,-  to  «<,  is  less  immediate  as 
one  needs  to  solve  a  (simple  tridiagonal)  system  of  n  linear 
equation.  Because  of  this  equivalence  we  can  use  the  coeffi¬ 
cients  Vi  as  unknowns  rather  then  the  x,,  which  eliminates 
the  need  for  solving  that  system. 

The  coefficients  v,  have  an  important  geometric  in¬ 
terpretation.  They  can  be  regarded  as  the  coordinates  of 
“guiding  polygon”  vertices.  The  curve  generally  follows  this 
polygon,  but  is  of  course  smoother  and  is  guaranteed  to 
have  lower  variation.  So,  the  vertices’  coordinates  tf,-  are 
not  very  far,  in  general,  from  the  knot  points  x,.  In  appli¬ 
cations  in  which  it  is  not  important  for  the  interpolating 
function  to  pass  through  the  data  points,  one  can  use  the 
data  points  as  vertices  of  the  guiding  polygon,  i.e.  as  the 
spline  coefficients  v,.  Then,  x(«)  and  their  derivatives  are 
immediately  computable  by  the  above  formula  (assuming  a 
is  approximated,  say,  by  «).  This  can  be  useful  in  obtaining 
a  first  approximation  for  the  curve  in  the  minimization  pro¬ 
cess.  In  some  applications,  it  is  the  knot  points  x<  which  are 
given  as  data,  and  the  curve  is  forced  to  pass  through  them. 
Then  one  must  solve  the  above  system  of  linear  equations 
to  find  the  coefficients  w<.  In  our  case  this  is  unnecessary  as 
we  treat  the  data  points  entirely  differently  (next  section). 


X.  REPRESENTING  THE  DATA 

Rather  then  forcing  the  curve  to  pass  through  the  data 
points,  we  assign  to  each  such  point  an  attraction  force 
which  it  exerts  on  the  curve.  This  force  should  possess 
some  desirable  characteristics.  It  should  be  strong  in  the 
vicinity  of  the  point  and  fall  off  at  a  large  distance.  The 
usual  least  squares  measure  has  the  opposite  property.  It 
emphasizes  the  effects  of  far  away  points,  which  have  lit¬ 
tle  to  do  with  the  data  to  be  interpolated.  The  vicinity  in 
which  the  force  is  most  effective  should  be  controllable,  as 
well  as  the  strength.  This  will  reflect  the  degree  of  reliabil¬ 
ity  in  determining  the  position  of  the  data  point.  A  reliable 
data  point  can  be  assigned  a  strong  force  in  a  smaller  vicin¬ 
ity.  The  force  is  proportional  to  the  length  of  the  curve  on 
which  it  acts,  ds,  for  it  acts  independently  on  each  (infinites¬ 
imal)  part  of  the  curve.  We  found  it  necessary,  in  addition, 
to  divide  the  force  by  the  total  length  of  the  curve  L,  to 
eliminate  the  incentive  for  the  curve  to  increase  its  length 
without  getting  closer  to  the  data  points.  Without  this  nor¬ 
malization  the  curve  may,  for  instance,  oscillate  around  the 
data  points. 

The  attraction  force  is  thus  represented  in  the  total 
energy  E  by  the  potential  energy  V(x)ds/L.  The  potential 
function  V(x)  is  the  cumulative  influence  of  all  the  data 
points  at  the  point  x,  and  is  independent  of  the  curve.  The 
potential  energy  of  the  curve  thus  depends  on  the  magni¬ 
tude  of  the  potential  function  V  at  the  points  through  which 
the  curve  passes.  The  force  itself  is  the  derivative  of  V.  The 
contribution  of  each  data  point  to  V  (x)  can  be  chosen  in 
several  ways,  consistent  with  the  requirements  above.  De¬ 
noting  this  contribution  by  ff(r),  where  r  =  |x  —  x*|  is  the 
distance  between  the  point  x,  and  a  data  point  x*,  one  may 
choose 


759 


where  a,b,e  are  parameters  at  our  disposal.  This  potential 
falls  off  at  large  distances,  with  its  effective  range  being  de¬ 
termined  by  6;  a  is  the  strength  of  the  contribution,  and  e 
helps  in  shaping  the  force  near  the  data  point.  We  found 
a  Gaussian  function  divided  by  r  (i.e.  e  »  l)  more  effec¬ 
tive  in  some  cases  then  a  pure  Gaussian  (c  =  0)  although 
the  difference  is  not  crucial.  Other  functions  are  possible 
but  have  not  been  tried,  such  as  as  a  Lorentzian  resonance 
function 

ff(r)  =  ri  +  &2 

This  too  satisfies  our  requirement,  without  the  need  to  cal¬ 
culate  exponential  functions,  which  may  be  time  consuming 
on  some  machines. 

The  potential  at  a  point  z,  V(z),  can  be  regarded  as  a 
discrete  convolution  over  the  picture  / : 

v(z)  =  £ /(*»)?(*-*») 

n 

where  the  summation  is  over  all  pixels  of  the  picture,  and 
/(£„)  is  the  picture  intensity  at  the  pixel  at  xn.  In  the  case 
of  a  simple  set  of  data  points  to  be  interpolated,  /  equals  1 
at  a  data  point  and  0  elsewhere.  In  that  case  V(£)  is  simply 
the  sum  over  the  data  points  s(x  —  £*). 

Now  the  question  arises  how  best  to  calculate  V.  V  has 
to  be  known  at  every  point  where  the  curve  is  calculated, 
rather  then  at  every  picture  point.  Thus  the  convolution 
need  only  be  done  at  points  lying  on  the  curve,  and  this  can 
save  a  considerable  amount  of  time,  even  if  several  curves 
are  computed  as  iterations  in  the  minimization  process. 

In  fact  even  with  this  saving  the  convolution  has  proven 
rather  time  consuming  on  a  VAX,  and  we  used  a  different 
approach.  We  calculated  the  convolution  on  a  fixed  rectan¬ 
gular  grid,  then  interpolated  V  at  the  curve  points  using  the 
V  values  on  these  fixed  grid  points.  This  amounts  to  fewer 
points  on  which  the  convolution  is  calculated,  because  the 
number  of  iterations  is  usually  quite  la^ne.  Typically  we 
used  a  20  x  20  grid,  while  the  curve  had  20  points  and  was 
iterated  70  times.  The  interpolation  was  carried  out  using 
the  cubic  Hermite  splines  mentioned  earlier,  in  a  2-variable 
generalization. 

Now  that  we  can  express  the  both  curve  and  V  at  every 
point,  we  can  calculate  the  integral  E.  The  integration  has 
to  be  done  numerically  on  each  interval  [£,,  £,+i].  Since 
the  integrand  is  made  up  of  polynomials  with  a  limited 
number  of  parameters,  only  a  few  internal  points  are  needed 
in  each  interval.  In  fact  we  found  that  using  the  mid-point, 
with  Simpson’s  rule,  is  a  satisfactory  approximation.  (In 
spite  of  the  polynomial  representation,  the  integral  cannot 
be  performed  analytically  due  to  the  presence  of  ds  which 
involves  a  square  root,  and  even  otherwise  it  would  be  quite 
tedious.) 


XI.  THE  MINIMIZATION  METHOD 
Once  we  have  expressed  the  energy  E  as  a  function  of  a 
finite  number  of  unknowns,  xj.yy,  it  remains  to  minimize  it 
with  respect  to  these  unknowns.  More  decisions  axe  needed 
here.  Basically  there  are  two  avenues  open  before  us.  The 
first  is  to  find  the  minimum  point  by  solving  the  system  of 
non-linear  equations 

d£(s,,y,) 

dXi 

dE{xuyj) 

dy} 

In  the  theoretical  paper  [Weiss,  1986a]  we  used  the  Euler- 
Lagrange  equations  to  minimize  E  analytically  in  some 
simple  cases.  The  above  minimization  method,  where  the 
quantities  in  E  are  expressed  as  a  superposition  of  finite  ele¬ 
ments,  amounts  to  discretizing  the  Euler-Lagrange  equation 
by  the  Rayleigh-Ritz  method. 

The  best  routine  we  know  of  for  the  solution  of  a  non¬ 
linear  system  of  equations  is  in  the  MINPACK  library,  avail¬ 
able  from  Argonne  National  Laboratory.  It  also  appears, 
under  the  name  SNSQE,  in  the  National  Bureau  of  Stan¬ 
dards  library  and  the  SLATEC  library.  It  is  based  on  a 
modified  Newton  method,  combined  with  Powell’s  method. 

This  method  was  on  the  whole  not  successful  in  our 
application.  The  Newton  method  needs  a  pretty  good  ini¬ 
tial  guess  in  order  for  it  to  converge.  Except  in  special 
cases  where  we  could  provide  one,  the  method  rarely  con¬ 
verged,  and  when  it  did  it  was  not  to  a  minimum  but  to 
some  stationary  point.  Thus,  it  wasn’t  hard  to  decide  to 
take  the  other  route,  and  treat  the  problem  as  a  non-linear 
optimization  problem. 

There  are  still  several  choices.  We  focused  on  two,  the 
conjugate  gradient  method,  and  the  variable  metric  BFGS 
method,  both  also  available  from  Argonne  National  Labo¬ 
ratory.  Both  methods  work  well  and  always  converge  to  a 
solution.  The  conjugate  gradient  method  is  guaranteed  to 
converge  to  the  minimum  in  n  steps,  where  n  is  the  num¬ 
ber  of  unknowns,  for  a  quadratic  function  E.  Our  problem 
is  in  general  not  quadratic,  and  3-4  times  that  number  is 
needed.  (The  program  uses  a  Beale  restart  if  more  then 
n  steps  are  executed.)  While  the  BFGS  algorithm  is  usu¬ 
ally  faster,  it  requires  considerably  more  memory  then  the 
conjugate  gradient  method,  on  the  order  of  n 1  numbers  vs, 
n.  While  the  differences  may  not  be  very  important  for  a 
curve,  they  may  be  of  crucial  significance  for  surface  inter¬ 
polation,  where  n  can  be  of  the  order  of  100  x  100.  For 
a  curve  with  20  knot  points,  the  BFGS  was  about  30-40% 
faster,  depending  on  the  particular  problem,  than  the  con¬ 
jugate  gradient  method,  but  required  1000  memory  cells  vs. 
200.  (»  n(n  +  7)/2  vs.  «  5n). 

At  this  point  there  arises  the  question  of  the  unique¬ 
ness  of  the  solution.  For  a  convex  function  E,  uniqueness 
is  assured.  If  a  function  is  convex  with  respect  to  its  argu¬ 
ments,  then  it  has  only  one  minimum.  (This  ha.  nothing  to 
do  with  the  convexity  of  the  curve  itself.)  Our  £  is  in  gen- 


760 


eral  non-convex.  One  reason  is  that  the  potential  function 

V  is  not  convex.  Thus  one  can  arrive  at  a  local  minimum, 
and  be  ignorant  of  a  lower  minimum  nearby.  The  choice 
of  the  parameters  in  the  attraction  force  g  has  an  impact 
on  the  convexity  of  V.  Quadratic  gs  would  always  yield  a 
quadratic  sum  V,  and  thus  V  would  have  a  unique  solution. 
We  rejected  this  kind  of  force  because  of  its  bias  towards 
distant  points.  A  finite  range  force  that  falls  of  beyond  an 
effective  distance  (6  in  our  notation),  such  as  a  Gaussian, 
is  quadratic  in  an  area  the  size  of  6,  and  thus  there  is  a 
unique  minimum  at  areas  of  roughly  this  size.  For  a  long 
range  force,  or  large  6,  there  will  be  only  a  few  minima  in 
the  picture.  However,  the  resolution  of  the  data  is  lower, 
as  g  smoothes  the  picture  by  being  convolved  with  the  pic¬ 
ture.  In  addition,  the  shallowness  of  the  resulting  minima 
mean  a  slow  convergence  rate.  A  short  range  force,  with 
small  6,  leads  to  a  curve  that  follows  the  data  more  accu¬ 
rately,  but  with  a  more  bumpy  function  to  minimize.  Thus 
the  parameter  6  can  control  the  overall  smoothness  of  the 
curve  to  some  extent,  along  with  the  total  number  of  knot 
points. 

Another  consequence  of  the  non-linearity  is  the  loss  of 
the  guaranteed  stability  that  the  linear  finite  element  the¬ 
ory  gives.  Instability  occurs  when  the  discretized  problem 
has  a  solution  of  its  own  which  is  very  far  from  the  continu¬ 
ous  one  (such  as  a  solution  that  oscillates  sharply  between 
the  mesh  points).  This  is  a  quite  common  problem  in  fi¬ 
nite  difference  methods.  The  finite  element  method  that  we 
are  using  is  well  understood  in  the  linear  case  and  insures 
stability  and  convergence:  the  solution  of  the  discrete  equa¬ 
tion  will  approach  that  of  the  continuous  one  as  n  increases. 
The  non-linear  behavior  is  not  that  well  understood.  Our 
implementation  had  a  minor  problem  which  we  cv  rently 
regard  as  instability.  The  curve  sometimes  “folds”  on  it¬ 
self  and  turns  sharply  by  180°.  The  turning  region  has  of 
course  a  high  curvature,  but  it  is  highly  localized  and  the 
discrete  mesh  manages  to  avoid  it.  A  detailed  inspection 
of  the  calculation  led  us  to  change  the  curvature  term  by 
either  interchanging  s  and  a  or,  better  still,  eliminate  s 
altogether  and  express  the  curvature  as  d2x/da2  (see  Sec. 
II).  This  has  indeed  almost  eliminated  the  folding  problem, 
although  more  clarification  of  the  situation  is  needed. 

XU.  RESULTS 

As  a  test  of  the  basic  principles  of  our  theory  we  com¬ 
pared  the  performance  of  our  method  against  that  of  a  sim¬ 
pler  energy  principle  [Horn,  1977]  in  representative  cases 
and  tested  the  effect  of  our  innovations.  We  also  tested 
the  effects  of  several  key  ingredients  of  the  theory.  For  this 
purpose  a  synthetic  data  set  was  sufficient. 

We  first  considered  an  interpolation  problem,  in  which 
information  is  given  only  at  the  end-points  of  the  curve, 
while  the  rest  of  the  curve  has  to  be  interpolated.  Here, 

V  =  0.  Fig.  6  the  curve  obtained  in  such  a  case.  This 
curve  might  represent  an  occluded  segment  of  a  shape  whose 
visible  part  is  represented  by  the  dashed  lines.  When  both 
tangents  were  90°,  we  have  obtained  a  semi-circle.  Without 
the  normalization  factor,  an  ellipse  was  obtained.  The  end¬ 


points  and  the  curve  tangents  there  were  maintained  by  a 
strong  quadratic  potential  applied  at  the  end-points  and 
the  points  adjacent  to  them. 

We  next  examined  the  effect  of  our  reparametrization 
of  the  curve  by  a.  This  was  done  by  using  a  right-angle  cor¬ 
ner  as  our  data  and  testing  how  well  the  curve  can  follow  it. 
The  spring  was  attracted  to  the  data  points  by  a  force  rep¬ 
resented  by  a  potential  V  described  earlier.  In  the  following 
figures,  crosses  represent  the  data  points,  and  diamonds  are 
the  knot  points.  Fig  7a  shows  the  curve  parametrized  sim¬ 
ply  by  the  arclength  s.  The  positions  of  the  knot  points 
were  determined  by  the  requirement  of  minimal  bending 
■f  potential  energy.  (No  stretching.)  In  Fig  7c  the  curva¬ 
ture,  or  bending  energy,  was  measured  with  respect  to  the 
parameter  a,  which  was  an  unknown  found  by  the  mini¬ 
mization  algorithm.  Fig  7b  is  an  intermediate  form  of  the 
curvature,  using  both  s  and  a.  (The  curvatures  in  Figs. 
7a,b,c  were  expressed,  respectively,  as 

d2x  d2x  d2x 

ds2  '  dsda  ’  da 2 ' 

One  can  clearly  see  that  the  corner  is  followed  much  more 
closely  when  a  is  used  in  place  of  s,  as  more  points  are  con¬ 
centrated  in  the  corner  in  this  case.  The  distance  between 
the  knot  points  in  the  corner  in  Fig  7c  is  about  a  quar¬ 
ter  of  their  distance  along  the  straight  sections.  In  Fig. 
7a,  in  contrast,  where  conventional  curvature  is  used,  the 
distance  between  the  knots  remains  constant,  and  one  can 
observe  the  strong  tendency  of  this  simpler  energy  measure 
to  round  the  comers  as  the  only  way  to  minimize  the  total 
curvature. 

The  particular  representation  of  the  curve  also  has 
some  effect  on  how  well  it  can  follow  the  data.  In  our  B- 
spline  representation,  s  has  a  continuous  second  derivative 
with  respect  to  a,  and  this  limits  the  amount  by  which  the 
distance  between  the  knot  points  can  vary  along  the  curve. 
In  the  line  segment  representation  this  variation  was  more 
pronounced  and  the  density  of  knot  points  in  the  comer 
was  greater  then  in  the  present  examples,  so  it  deserves  a 
further  look  in  spite  of  some  instability  problems.  A  rep¬ 
resentation  by  Bessel  or  Akima  cubic  splines,  which  have 
continuous  first  derivative,  also  merits  consideration. 

Our  next  test  involves  interpolating  sparse  data  points 
indicating  some  arbitrarily  winding  curve.  The  curve  is 
neither  convex  nor  closed.  There  was  no  need  for  any  con¬ 
ditions  on  the  boundary.  Fig.  8a  shows  the  interpolating 
curve  with  20  knot  points  (which  do  not  coincide  with  the 
data  points).  In  Fig.  8b  the  resolution  of  the  curve  was  re¬ 
duced  to  10  points.  The  curve  is  smoother,  but  follows  the 
data  less  closely.  A  similar  effect  can  be  observed  in  Fig. 
8c;  this  time  the  curve’s  resolution  was  kept  high,  while  the 
effective  range  of  the  potential  g  was  increased.  That  has 
the  effect  of  smoothing  the  data  itself. 

Finally,  a  planar  curve  in  3-D  was  obtained  from  input 
consisting  of  two  end-points  with  known  coplan ar  tangents. 
The  result  is  similar  to  Fig.  7c,  except  that  it  lies  in  the 
3-D  space. 


761 


xra.  CONCLUSIONS 

In  this  paper  we  have  presented  a  new  method  of  re¬ 
constructing  a  3-D  surface  from  given  contours.  Some  ad¬ 
vantages  of  this  mechanism  are  its  ability  to  handle  shapes 
containing  both  small-scale  and  large  scale  features,  and 
dimensionlessness.  These  are  among  the  requirements  that 
any  reconstruction  mechanism  will  have  to  meet.  The 
method  is  based  on  representing  the  surface  as  a  network 
of  springs,  with  each  (natural)  coordinate  curve  on  the  sur¬ 
face  being  analogous  to  one  spring.  The  springs  contribute 
energy  from  both  their  bending  and  stretching,  and  the  op¬ 
timal  surface  is  obtained  by  minimizing  the  energy  of  the 
system  over  all  possible  spring  networks  which  conform  with 
the  given  data. 

The  springs’  “coils”  represent  knots  of  a  varying  mesh, 
that  is  used  to  overcomes  the  problem  of  variations  of  differ¬ 
ent  scales  in  different  parts  of  the  curve.  The  mesh  is  finer, 
or  the  coils  are  more  densely  concentrated,  in  parts  of  the 
curve  that  change  more  rapidly.  The  same  minimum  prin¬ 
ciple  determines  both  the  optimal  placement  of  the  knots 
and  the  shape  of  the  curve  (or  surface)  between  the  knots. 

We  have  examined  a  few  cases  that  can  be  handled  an¬ 
alytically,  mainly  because  of  their  symmetry.  A  surface  of 
revolution,  for  instance,  along  with  its  meridians  and  par¬ 
allels,  could  be  reconstructed  from  appropriate  boundary 
conditions.  These  cases  also  seem  consistent  with  human 
perception. 

We  have  implemented  the  theory  numerically  in  the 
2-D  case.  The  implementation  has  produced  results  on  sev¬ 
eral  levels.  At  the  higher  level  of  evaluating  the  theory  as 
a  whole,  we  have  a  proof  of  the  basic  principle  described 
above.  The  program  has  worked  well  in  this  respect.  As  we 
saw,  the  curve  has  no  problem  following  sharp  corners,  by 
placing  more  knot  points  there.  This  problem  of  different 
rates  of  change  of  a  curve,  such  as  the  difference  between 
a  corner  and  a  straight  line,  has  plagued  all  other  curve 
fitting  techniques  and  it  seems  that  our  method,  in  princi¬ 
ple,  solves  the  problem.  In  addition,  our  method  does  not 
impose  the  restrictions  that  many  other  methods  do.  In 
particular,  the  curve  need  not  be  convex,  closed,  nor  single 
valued.  There  remain  other  problems  that  usually  arise  in 
minimization  methods.  The  non-uniqueness  of  the  solution 
means  that  the  method  will  need  some  higher  level  guidance 
to  choose  the  “right”  solution.  Efficiency  is  another  issue, 
although  it  was  not  our  goal  here.  It  would  be  interesting 
to  consider  the  question  of  whether  a  multi-resolution  algo¬ 
rithm,  which  was  successful  in  accelerating  the  convergence 
of  a  linear,  fixed  grid  minimization  problem  {Terzopoulos, 
1984],  would  be  applicable  to  our  case. 

On  another  level,  the  implementation  has  improved 
our  understanding  of  the  theoretical  problem  and  led  to 
some  changes.  Among  them  were:  (i)  The  bending  term 
was  generalized  somewhat,  to  include  various  possibilities 
of  expressing  the  curvature  as  a  second  derivative  of  x  with 
respect  to  a  and  a.  (ii)  The  treatment  of  the  global  factor  L. 
By  treating  it  as  an  independent  variable  to  be  determined 
by  the  Lagrange  multiplier  method  we  have  been  able  to 
keep  the  global  computations  to  a  minimum.  This  will  be 


important  in  an  implementation  on  a  parallel  machine. 

On  the  numerical  level,  we  have  learned  about  vari¬ 
ous  numerical  aspect  of  the  implementation  and  found  suit¬ 
able  ways  of  representing  and  discretizing  the  curve  and  the 
data.  We  have  dealt  with  convergence  and  stability  prob¬ 
lems,  and  have  examined  the  effects  of  various  parameters 
in  the  theory.  All  this  has  great  value  in  implementing  the 
more  general  case  of  3-D  surface  interpolation. 

REFERENCES 

Barnhill,  R.E.  and  R.F.  Riesenfeld  [1974],  Computer 
Aided  Geometric  Design,  Academic  Press,  New  York. 

Barrow,  H.G.  and  J.M.  Tenenbaum  [1981],  “Interpret¬ 
ing  line  drawings  as  three-dimensional  surfaces”,  Artif. 
Intel!.,  IT,  75-117. 

Brady,  J.M.  and  H.  Asada  [1984],  “Smoothed  local  sym¬ 
metries  and  their  implementation”,  A.  I.  Memo  No. 
757. 

Brady,  J.M.,  J.  Ponce,  A.  Yuille,  and  H.  Asada 
[1984],  “Describing  surfaces”,  Proc.  2nd.  Symp.  Rob. 
Res.  (Kyoto). 

Brady,  J.M.  and  A.  Yuille  [1984],  “An  extremum  princi¬ 
ple  for  shape  from  contour” ,  IEEE  Trans.  Pott.  Anal. 

&  Mach.  Int.,  PAMI-6. 

Courant,  R.  and  D.  Hilbert  [1953],  Methods  of  Mathe¬ 
matical  Physics,  Vol.  I,  Interscience,  London. 

Davis,  L.,  L.  Janos  and  S.  Dunn  [1982],  “Efficient  re¬ 
covery  of  shape  from  texture”,  Computer  Vision  Lab, 
University  of  Maryland,  TR-1133. 

DeBoor,  C.  [1978],  A  Practical  Guide  to  Splints,  Springer- 
Verlag,  New  York. 

Do  Carmo,  M.  [1976],  Differential  Geometry  of  Curves 
and  Surfaces,  Prentice-Hall,  Englewood  Cliffs,  NJ. 

Grimson,  W.E.L.  [1981],  “The  implicit, constraint  of  the 
primal  sketch”,  MIT  A.I.  Memo  No.  663. 

Horn,  B.K.P.  [1981].  “The  curve  of  least  energy”,  MIT 
A.  I.  Memo  No.  612a. 

Marr,  D.  [1982],  Vision,  Freeman,  San  Francisco. 

Pavlidis,  T.  [1977],  Structural  Pattern  Recognition,  Springer- 
Verlag,  New  York. 

Rosenfeld,  A.  and  A.C.  Kak  [1982],  Digital  Picture  Pro¬ 
cessing,  2nd  edition,  Academic  Press,  New  York. 

Stevens,  K.A.  [1981],  “The  visual  interpretation  of  surface 
contours”,  Artif.  Intell.,  IT,  47-73. 

Strang,  G.,  and  G.J.  Fix  [1973],  An  Analysis  of  the  Fi¬ 
nite  Element  Method,  Prentice-Hall,  Englewood  Cliffs, 
N.J. 


762 


Terxopoulos,  D.  [1982],  “Multilevel  reconstruction  of  vi¬ 
sual  surfaces:  variational  principles  and  finite  ele¬ 
ment  representation.”  MIT  AI  Memo  671.  Reprinted 
in  Multiresolution  Image  Protesting  and  Analysis,  A. 
Rogenfeld  (ed.),  Springer- Verlag,  New  York,  1984. 

Tersopoulos,  D.  (’985],  “Computing  visible  surface  rep¬ 
resentations”  ,  MIT  AI  Memo  No.  800. 

Weiss,  I.  [1986a],  “3-D  Surface  Representation  by  Con¬ 
tours”,  Computer  Vision  Lab,  University  of  Maryland, 
CAR-TR-222. 

Weiss,  I.  [1986b],  “Curve  Fitting  with  Optimal  Mesh 
Point  Placement”,  Computer  Vision  Lab,  University 
of  Maryland,  CAR-TR-224. 

Witkin,  A.  [1981],  “Recovering  surface  shape  and  orien¬ 
tation  from  texture”,  Artif.  Intell.,  17,  17-45. 

Young,  D.  and  R.  Gregory  [1973],  A  Survey  of  Nu¬ 
merical  Mathematics,  vols.  1  and  2,  Addison- Wesley, 
Reading,  MA. 

Acknowledgement 

The  author  thanks  J.  Michael  Brady  of  MIT’s  AI  Lab 
(now  at  Oxford  University)  for  very  valuable  discussions, 
comments  and  suggestions. 


(2 


A  corner 


Fig.  3 

Ribbons  with  symmetry  curves 


763 


STEKEO  MATCH  IK  USIK  VITOS!  ALGORITHM 


G.V.S.  Raju* 

Dept,  of  Electrical  and 
Computer  Engineering 
Ohio  University 
Athens,  OH  45701 


Thomas  0.  Binford 
AI  Laboratory 
Computer  Science  Dept. 
Stanford  University 
Stanford,  CA  94305 


S.  Shekhar 
AI  Laboratory 
Computer  Science  Dept. 
Stanford  University 
Stanford,  CA  94305 


ABSTRACT 

The  key  problem  in  stereo  is  a  search 
problem  which  finds  the  correspondence  between 
points  in  left  and  right  images,  so  that,  given 
the  camera  model,  the  depth  can  be  computed  from 
triangulation.  This  paper  presents  a  stereo 
matching  method  using  a  modified  Viterbi  algo¬ 
rithm.  Our  algorithm  matches  surfaces  between 
edges  (scanline  intervals  between  edgels  belong¬ 
ing  to  edges)  in  the  two  images.  Interval  ratio, 
edgel  orientation  angles  and  contrast  are  used  in 
developing  a  similarity  function  (cost  or  gain) 
to  match  edgels  (intervals)  in  the  two  images  on 
the  same  scanline.  The  major  advantages  of  our 
method  is  in  the  computation.  The  method  has 
been  tested  with  different  types  of  images  and 
the  results  are  presented  in  the  paper. 


INTRODUCTION 

In  the  past  few  years,  we  have  Seen  a  grow¬ 
ing  interest  in  the  application  of  3D  image 
processing.  With  increasing  demand  for  3D  spa¬ 
tial  information  for  tasks  of  passive  navigation 

2],  automatic  surveillance  [3],  aerial  car* 
tography  [4,  5J ,  and  inspection  in  industrial 
automation,  the  importance  of  effective  stereo 
analysis  has  been  made  quite  clear.  A  particular 
challenge  in  this  area  is  to  provide  reliable  and 
accurate  depth  data  for  input  to  object  or  ter¬ 
rain  modeling  systems. 

Barnard  and  Fischler  (6]  viewed  the  problem 
of  stereo  analysis  in  terms  of  image  acquisition, 
camera  modeling,  feature  acquisition,  image 
matching,  depth  determination  and  interpolation. 
The  crucial  one  is  that  of  image  matching  which 
identifies  the  corresponding  points  in  two  images 
that  are  cast  oy  the  same  physical  point  in  3D 
space.  This  research  work  is  concerned  only  with 
this  problem.  The  paper  addresses  a  research 
problem  which  finds  the  correspondence  points  be¬ 
tween  the  left  and  right  images,  so  that  given 
the  camera  model ,  the  depth  can  be  computed  by 
triangulation.  Knowing  the  geometric  relation¬ 
ships  betveen  the  cameras  used  in  the  imaging  can 
reduce  image- to- image  correspondence  to  a  set  of 


*  On  sabbatical  leave  at  the  Artificial  Intel¬ 
ligence  Laboratory,  Computer  Science  Department, 
Stanford  University,  Stanford,  CA  94305. 


scanline-to-scanline  correspondence  problems. 

When  a  pair  of  stereo  images  is  rectified,  the 
epipolar  lines  are  horizontal  scanlines  (rows). 
The  epipolar  lines  are  intersections  of  the  two 
image  planes  with  an  epipolar  plane  which  passes 
through  an  object  point  and  the  two  camera  foci. 
Figure  1  shows  the  geometry  of  this  situation. 

In  edge-based  stereo,  edges  in  the  images  are 
used  as  the  elements  whose  correspondences  are  to 
be  found. 


Fig.  1  Collinear  Epipolar  Geometry 

A  pair  of  corresponding  edgels  (edge- 
elements  [7])  in  the  right  and  left  images  should 
be  searched  for  only  within  the  same  horizontal 
scan  lines.  The  search  within  the  same  horizon¬ 
tal  scanlines  can  be  treated  as  the  problem  of 
finding  a  matching  path  on  a  two-dimensional  (2D) 
search  plane  whose  vertical  and  horizontal  axes 
are  the  right  and  left  scanlines.  A  modified 
viterbi  algorithm  (a  version  of  dynamic  program¬ 
ming  technique)  [8,  9]  can  be  used  for  this  2D 
search  efficiently. 

However,  if  there  is  an  edge  (curve  composed 
of  edgels)  extending  across  scanlines,  the  cor¬ 
respondences  in  one  scanl ine  have  strong  depen¬ 
dency  on  the  correspondences  in  the  neighboring 
scanl  ines  because  if  two  edgels  are  on  a  non¬ 
horizontal  edge  in  the  left  image,  their  cor¬ 
responding  edgels  most  likely  lie  on  a  cor¬ 
responding  edge  in  the  right  image  [10],  To  ac¬ 
count  for  this,  edge  correspondence  is  sought 
through  the  modified  viterbi  algorithm. 

The  various  stereo  matching  methods  differ 
in  the  primitives  used  for  matching,  similarity 
measures  used  for  matching,  and  the  techniques 
(algorithms)  used  for  local  and  global  matching. 


766 


In  this  paper  we  use  edgels  as  a  primitive  for 
local  matching  (scanline  to  scanline)  and  edges 
(curves  composed  of  edgels)  as  a  primitive  for 
global  matching.  Several  similarity  functions 
are  developed  which  are  used  to  estimate  the  cost 
of  edgel  (interval)  similarity.  The  stereo 
matching  method  developed  in  this  paper  also  uses 
modified  viterbi  algorithm,  as  in  the  case  of  Ar¬ 
nold  [8],  Baker  and  Binford  [9],  and  Ohta  and 
Kanade  [10],  to  determine  edgel  correspondence 
and  edge  correspondence.  The  major  advantage  of 
the  method  described  here  is  in  the  computation. 

First,  we  give  a  brief  review  of  previous 
work  in  stereo  matching,  then  we  describe  our 
method  and  the  implementation  details,  and 
finally  present  some  of  the  results  of  applying 
this  method  to  real  images  and  conclusions. 

REVIEW 

For  stereo  matching,  area-based  [11,  12]  and 
feature-based  [8,  9,  10,  13,  18]  techniques  have 
been  used.  In  area-based  analysis  two- 
dimensional  windowing  operators  measure  the 
similarity  in  intensity  pattern  between  local 
areas,  or  windows,  in  the  two  images.  Cross¬ 
correlation  is  used  to  determine  matches  between 
windows  in  ohe  image  with  windows  in  the  other. 
Area-based  correspondence  has  been  applied  quite 
successfully  to  the  stereo  analysis  of  rolling 
terrain,  but  it  degrades  when  the  scene  is  not 
smoothly  varying  and  continuous.  It  requires  the 
presence  of  detectable  texture  within  each  cor¬ 
relation  window,  and  consequently  it  tends  to 
fail  in  featureless,  repetitive-texture  environ¬ 
ments,  or  surface  discontinuities.  At  surface 
discontinuities  there  are  occlusions  and  no  pos¬ 
sible  correspondence  between  areas  crossing  the 
occlusion. 

Feature-based  systems  match  features  derived 
from  the  two  images  rather  than  the  intensity  ar¬ 
rays.  The  most  commonly  used  features  are  edges 
that  are  extracted  from  the  intensity  image  for 
matching.  The  underlying  principle  is  that  a 
discontinuity  in  the  intensity  represents  a  dis¬ 
continuity  on  the  physical  surfaces  in  the  scene. 
The  discontinuity  can  be  due  to  surface  depth, 
orientation,  reflectance,  or  illumination.  By 
matching  these  features  we  match  physical  curves 
on  object  surfaces. 

The  feature-based  systems  are  less  sensitive 
to  photometric  variations  because  they  represent 
geometric  properties  of  a  scene  and  are  faster 
since  the  number  of  features  to  consider  are 
fewer  than  the  number  of  pixels.  Also,  the  match 
is  more  accurate  as  the  features  (edges)  in  an 
image  can  be  located  with  subpixel  accuracy. 

Since  the  feature  matching  leads  only  to  a  sparse 
depth  map,  we  need  to  reconstruct  dense  depth  map 
through  surface  interpolation,  area  matching,  or 
model-based  interpretation. 

The  most  widely  known  edge-based  stereo  sys- 
tan  is  that  of  Harr  and  Poggio  [14],  as  imple¬ 
mented  by  Grimson.  It  uses  an  uniqueness  con¬ 
straint  to  assign  at  most  one  disparity  value  to 
each  point  in  the  image  and  continuity  require¬ 


ment  such  that  disparity  values  change  smoothly, 
except  at  depth  discontinuities. 

Medioni  and  Nevatia  [13]  used  segments, 
groups  of  collinear  connected  edge  points,  as 
matching  primitives.  In  their  study,  the  cor¬ 
respondence  is  based  on  a  minimum  differential 
disparity  criterion.  Arnold  and  Binford  [15] 
developed  an  edge-based  stereo  correspondence 
system  that  used  local  edged  properties  to  select 
edge!  match  possibilities,  and  a  weighted  itera¬ 
tion  process  to  resolve  match  conflicts. 

Baker  [16],  Arnold  [8],  Ohta  and  Kanade 
[10],  and  Baker  and  Binford  [9]  have  used  dynamic 
programming  technique  (viterbi  algorithm)  in 
stereo  matching.  Baker  first  processed  each  pair 
of  scanlines  independently.  After  all  the  scan¬ 
line  matching  was  done,  he  used  a  cooperative 
process  to  detect  and  correct  the  matching 
results  which  violate  the  consistency  con¬ 
straints.  Since  this  method  does  not  use  the 
across-scan-line  constraints  directly  in  the 
search,  the  result  from  the  cooperative  process 
is  not  guaranteed  to  be  optimal.  Ohta  and  Kanade 
applied  dynamic  programming  for  scanline  search 
and  across-scanl ine  search.  The  problem  of  ob¬ 
taining  correspondence  between  edgels  on  right 
and  left  epipolar  scanlines  was  solved  as  a  path 
finding  problem  on  a  2D  plane.  THe  problem  of 
obtaining  a  correspondence  between  edges  under 
across-scanline  consistency  constraint  was  viewed 
as  a  problem  of  finding  a  set  of  paths  in  a  3D 
space  which  is  a  stack  of  2D  planes  for  across- 
scanline  search.  The  optimal  path  on  the  2D 
plane  is  obtained  by  iterating  the  selection  of 
an  optimal  path  on  the  2D  plane  is  obtained  by 
iterating  the  selection  of  an  optimal  path  at 
each  2D  node  (intersection  of  left  edge  and  right 
edge).  Similarly,  the  optimal  set  of  paths  in  3D 
space  was  obtained  by  iterating  the  selection  of 
an  optimal  set  of  paths  at  each  3D  node.  The 
numbers  of  computations  involved  are  estimated  to 
be  0(T  U2  V2  M2  N2)  where  T  stands  for  the  number 
of  scanlines  in  the  images,  U  and  V  are  for  the 
number  of  edges  in  the  right  and  left  images,  and 
M  and  N  for  the  average  numbers  of  edgel  points 
in  one  scanline  in  the  right  and  left  images, 
respectively.  This  is  a  tremendous  number  of 
computations  involved  in  the  search  for  cor¬ 
respondence. 

In  the  Ohta  and  Kanade  study  the  computation 
of  cost  in  the  search  algorithm  vas  based  on  the 

cost  of  the  primitive  path  on  a  2D  search  plane. 
The  cost  of  a  2D  primitive  path  was  defined  as 
the  similarity  between  intervals  delimited  by 
edgels  in  the  right  and  left  images  on  the  same 
scanline.  The  similarity  was  measured  in  terms 
of  the  variance  of  th^>  intensity  values  of  the 
pixels  which  comprise  the  two  intervals. 

MODIFIED  VITERBI  ALGORITHM  [8] 

Dynamic  programming  is  a  technique  useful 
for  matching  two  sequences.  It  solves  an  N-stage 
decision  process  as  N  single-stage  processes. 

This  reduces  the  computational  complexity  to  the 
logarithm  of  the  original  combinatorial  one.  The 


767 


viterbi  algorithm  is  a  dynamic  programming  algo¬ 
rithm  which  finds  a  best  match  from  among  all  the 
allowable  matches.  The  sequences  to  be  matched 
define  the  two  dimensions  of  a  matrix  such  that 
each  entry  in  the  matrix  represents  the  cost  of 
matching  that  pair  of  elements.  A  path  will  con¬ 
sist  of  a  sequence  of  nodes,  each  of  which  cor¬ 
responds  to  one  entry  in  the  matrix.  The  goal  is 
to  find  an  optimal  path  through  the  cost  matrix 
such  that  the  sun  of  the  costs  along  the  path  is 
a  maximum.  To  determine  the  optimal  path,  two 
constraints  are  used:  (1)  The  sequences  must  be 
matched  monotonically  and  (2)  each  element  of  a 
sequence  is  used  at  most  once.  These  constraints 
are  equivalent  to  assuming  that  the  path  must 
start  in  the  lower  left  corner  and  end  in  the  up¬ 
per  right.  From  each  node,  a  transition  may  only 
be  made  one  unit  vertically,  horizontally,  or 
diagonally  [8]. 

In  the  case  of  edgel  matching,  the  sequences 
to  be  matched  are  edgels  from  left  scanline  and 
from  the  corresponding  right  (epipolar  line) 
scanline  for  2D  match.  In  the  case  of  edge 
matching,  the  sequences  are  edges  from  left  and 
right  images.  It  is  possible  that  certain 
sequence  elements  may  have  no  match  in  the  other 
sequence  due  to  occlusion  in  the  images.  For  un¬ 
matched  elements  zero  cost  is  assigned.  Vertical 
or  horizontal  transitions  of  one  unit  indicate 
occlusion  of  the  element  whose  row  or  column  is 
being  entered.  Diagonal  transitions  of  one  unit 
indicate  a  normal  match  associated  with  elements 
belonging  to  a  newly  entered  row  and  column. 

The  viterbi  algorithm  proceeds  by  construct¬ 
ing  a  second  matrix,  of  the  same  dimensions  as 
the  first,  each  entry  with  the  accumulated  cost 
of  the  optimum  path  from  starting  node  to  current 

node.  The  matrix  values  are  filled  in  ascending 
order,  left  to  right  and  bottom  to  top,  beginning 
at  the  lower  left.  The  transition  rules 
guarantee  that  when  it  comes  time  to  fill  an 
entry  its  three  predecessors  will  already  have 
been  assigned  values.  The  algorittn  simply  ex¬ 
amines  horizontal  and  vertical  predecessor  ac¬ 
cumulated  cost  values  and  the  sum  of  diagonal 
predecessor  accumulated  cost  value  and  the  cost 
for  the  current  position  from  the  first  matrix 
and  then  selects  the  maximum.  This  maximum  be¬ 
comes  the  current  entry  value  for  a  second  matrix 
and  a  pointer  is  stored  to  indicate  which 
predecessor  was  selected.  After  the  last  posi¬ 
tion  has  been  filled,  the  stored  pointers  are 
followed  backward  to  the  starting  node,  tracking 
out  the  optimum  path  from  the  upper  right  to  the 
lower  left.  It  is  possible  that  the  first  or 
last  few  elements  of  a  sequence  are  unmatched. 

This  corresponds  to  allowing  paths  to  begin  at 
any  point  in  the  first  row  or  column  and  end  at 
any  point  in  the  last  row  or  coltan.  To  achieve 
this,  a  zero  rcw  at  the  bottom  and  a  zero  column 
at  the  left  are  added  to  the  first  matrix.  Then, 
the  second  matrix  is  constructed  as  before  to  ob¬ 
tain  the  optimal  path  which  gives  the  best  match. 
This  procedure  gives  the  same  kind  of  results  of 
Baker  [16].  Fig.  2  illustrates  the  application 
of  the  modified  viterbi  algorithm  for  a  simple 
example. 


Example  1 


interval  associated 
with  edgel  3 


Left  Image  Right 


5 

0 

0 

.1 

.3 

Lef  t  4 

0 

.2 

0 

.8 

Edgel  3 

.1 

0 

.4 

0 

Indices  2 

.2 

.7 

0 

.1 

1 

.1 

0 

0 

i 

2 

3 

4 

Right  Edgel  Indices 
(First)  Edgel-Cost  Matrix 


Second  Matrix  Where  Optimal  Path 
and  Matched  Edgels  Are  Shown 

Fig.  2  Edge!  Matching 

EDGEL  CORRESPONDENCE  SEARCH  ON  SCANLINES 

The  correspondences  between  edgels 
(intervals)  on  the  right  and  left  epipolar 
(horizontal  scanlines)  lines  can  be  solved  on  a 
2D  plane  using  the  viterbi  algorithm  described  in 
the  earlier  section.  The  edgels  on  each  scanline 
are  indexed  in  increasing  order  from  left  to 
right.  Both  ends  of  a  scanline  are  treated  as 
edgels.  The  cost  of  matching  each  pair  of  edgels 
(2D  node)  in  the  left  and  right  scanlines  are 
evaluated  using  a  similarity  function  which  is 
described  in  a  later  section.  Using  these  edgel- 
costs  as  elements,  an  edgel-cost  matrix  (first 
matrix)  is  created.  The  viterbi  algorithm  is 
used  to  construct  a  second  matrix  which  gives  the 
optimal  path  and  best  edgel  match.  This  is  ex¬ 
plained  uaing  Example  1.  Unlike  in  references 
[8,  9,  10],  in  this  study  edgel  correspondence 
search  is  used  only  to  match  edgels  between  edges 
on  each  scanline. 


EDGE  CORRESPONDENCE  SEARCH  ON  THE  TWO  IMAGES 

The  edgel  (interval)  correspondences  on 
scanlines  alone  cause  ambiguity.  To  resolve  this 


( 

( 


768 


ambiguity  and  to  obtain  global  correspondences, 
edge  (connected  edgels  across  scanlines)  cor¬ 
respondences  are  used.  The  problems  of  cor¬ 
respondences  between  edges  (intervals  between 
edges)  in  the  two  images  can  be  viewed  as  an  ex¬ 
tension  of  scanline  correspondence  search. 

For  illustrative  purposes,  let  us  consider 
the  right  and  left  usages  in  Fig.  3.  The  images 
are  three  scanlines  wide.  There  are  five  inter¬ 
vals  between  edges  (including  right  end;  left  end 
is  treated  as  beginning)  in  the  left  image  and 
four  in  the  right  image.  (Even  though  the  edges 
are  shown  as  vertical  lines,  in  reality,  they  are 
curves. ) 


interval 
.  V 


Left  Image 


scanl  ine 


edge 


1  m  n  o  P 


Using  these  edge-costs  (3D  node  costs)  as 
elements,  an  edge-cost  matrix  (first  matrix)  is 
computed.  The  viterbi  algorithm  is  used  to  con¬ 
struct  a  second  matrix  which  gives  the  optimal 
path  and  best  edge  match. 

METHOD 

In  Fig.  3,  edges  belonging  to  left  image  and 
right  image  are  shown.  Even  though  the  method 
matches  edges  in  the  two  images  to  get  depth  in¬ 
formation,  in  reality  it  is  equivalent  to  match¬ 
ing  surfaces  between  edges  in  the  two  images. 

The  method  involves  the  following  steps  and 
computations: 

(1)  The  edgels  are  obtained  by  applying  the 
operator  described  in  [7)  to  a  pair  of  stereo 
images.  The  operator  produces  edgels  by  fitting 
a  directional  tanh-surface  to  windows  in  the 
images.  These  edgels  are  then  linked. 

(2)  Edges  are  formed  by  an  edge-linking 
process  such  that  two  edges  in  an  image  do  not 
intersect. 

(3)  Edges  are  ordered  and  labeled  in  both 
images. 

(4)  The  edgel  (interval)  similarity  cost  is 
computed  from  the  formula: 

cost  edgelj;(t)  .  ^  (interval  ratio  similarity 
function) 

x  (edgel  orientation  similarity  function)  (edgel 
contrast  similarity  function) 

for  all  i,  j,  and  t 


Right  Image 

The  number  in  the  circle  next  to  an  edge 
is  its  index. 


where 

i  *  index  of  edgel  in  the  right  image 
j  »  index  of  edgel  in  the  left  image 
t  »  scanline  index 
k  •  constant 


Fig.  3  Images  Showing  Edges,  Intervals  and  Scan¬ 
lines. 

A  pair  of  edges  in  the  left  and  right  images 
make  a  set  of  2D  nodes  when  they  share  scanline 
pairs.  This  set  of  2D  nodes  is  referred  to  as  a 
single  3D  node.  The  cost  at  a  3D  node  is  the  sum 
of  the  costs  of  2D  nodes  belonging  to  that  3D 
node. 

For  example,  the  cost  of  matching  edge  2  in 
the  right  image  with  edge  1  in  the  left  image  (3D 
node  21)  is  tbe  sum  of  the  coats  of  matching  in¬ 
terval  fg  with  f ' g '  (2D  node  21)  on  scanline  2 
and  an  with  n'o'  (2D  node  21)  on  scanline  3. 
Similarly,  the  cost  of  matching  edge  2  in  tbe 
right  image  with  edge  2  in  the  left  image  (3D 
node  22)  is  tbe  sum  of  tbe  costs  of  matching  ab 

with  a'b'  on  scanline  1,  fg  with  g'b'  on  scanline 
2,  and  mu  with  o'p'  on  scanline  3. 


All  the  above  four  steps  are  explained  in 
detail  in  the  implementation  section  later. 

(5)  The  surface  (interval  between  neighbor 
edges)  similarity  cost  (edge  cost)  is  given  by 

C08tu,v(s)  -t^cost  edgeli, j(t)  -$Cu,v(t) 

where  u  and  v  are  edge  indices  in  tbe  right  and 
left  images,  respectively.  Tbe  siaamation  is  over 
the  scanlines  which  are  common  to  edge  u  in  the 
right  image  and  edge  v  in  the  left  image.  This 
forms  one  element  (u,v)  in  the  edge  cost  matrix 
which  has  dimension  of  U  x  V  where  U  is  the  num¬ 
ber  of  edges  in  the  right  image  and  V  is  the  num¬ 
ber  of  edges  in  the  left  image.  The  edgel  j 
belongs  to  edge  u  and  edgel  j  belongs  to  edge  v. 

(6)  After  determining  the  edge  cost  matrix 
(first  matrix)  from  step  5,  it  is  optimised  using 
viterbi  algorithm  to  obtain  a  second  matrix  as 
shown  below: 


769 


co,to,u,v<»)  -  Max  {  (cost0,u-l + 

costu,v(a)),costo,u,v-l(s),  CO 8 to , u— 1  ,v  8  } 

COStO,0.o(s)  “  COSto,u,o(»)  “  COSto,o,v(s)  =  0 

1  <  u  <  U 

1  1  V  <  V 


From  the  above  three  edgel  cost  matrices, 
each  element  in  the  edge  cost  matrix  is  computed 
by  stmming  the  costs  belonging  to  the  same  3D 
node.  For  example,  3D  node  22  cost  is  computed 
by  summing  three  values  belonging  to  3D  node  22 
on  three  scanline  edgel  cost  matrices,  The  value 
is  .8  +  .7  +  .1  -  1.6.  The  edgel  cost  (first) 
matrix  and  second  matrix  are  given  in  Fig.  5. 


This  gives  the  optimal  path  and  best  match 
of  edges.  The  computations  involved  in  this 
method  for  optimisation  are  3UV.  Example  2  il¬ 
lustrates  the  steps  involved  in  the  method. 

EXAMPLE  2 

For  Fig.  3,  the  edgel  cost  matrices  are 
given  for  each  of  the  scanlines.  On  scanline  1, 
there  are  3  edgels  (intervals)  belonging  to  edges 
2,  3  and  4,  respectively,  in  the  right  image. 
Similarly,  there  are  4  edgels  (intervals)  on 
scanline  1  in  the  left  image.  Therefore  the 
edgel  cost  matrix  for  scanline  1  is  3  x  4.  For 
scanlines  2  and  3,  the  edgel  cost  matrices  are 
respectively  4x5  and  4x4.  These  matrices  are 
given  in  Fig.  4. 


5 

.1 

15 

.1 

25 

.2 

35 

T~ 

45 

4 

.1 

.2 

.4 

4 

.1 

.2 

.2 

.2 

Left 

25 

35 

45 

14 

24 

34 

44 

Edgel  3 

.3 

.7 

.1 

3 

.1 

.1 

.2 

.3 

Indices 

24 

34 

44 

13 

23 

33 

43 

2 

.4 

.3 

.2 

2 

.6 

./ 

.1 

.2 

23 

33 

43 

12 

22 

32 

42 

1 

.8 

.6 

.3 

1 

.8 

.3 

.2 

.1 

22 

32 

42 

11 

21 

41 

1  J  Right1  Ed  gel  Indices**' 


Scanline  1  Scanline  2 

Edgel  Cost  Matrix  Edgel  Cost  Matrix 


4 

.1 

.1 

.2 

.6 

15 

25 

35 

45 

3 

.2 

.1 

.5 

.3 

14 

24 

34 

44 

2 

.4 

.1 

.3 

.2 

12 

22 

32 

42 

1 

.9 

.3 

.1 

.]y 

11 

21 

31 

41 

1 

1 

3 

4 

Scanline  3 
Edgel  Cost  Matrix 

Mote:  Double  digit  number  in  each  element  indi¬ 

cates  indices  of  edges  in  the  right  and 
left  images,  i.e. ,  3D  node  numbers. 

Fig.  4  Edgel  Cost  Matrices  for  the  images  in 
Fig.  3. 


5 

.2 

.3 

.6 

1.3 

Left  4 

.3 

.6 

1.4 

.6 

Edges  3 

.1 

.5 

.5 

.5 

v  2 

1.0 

1.6 

1.0 

.7 

1 

1.7 

1.6 

1.0 

.7 

1 

2 

3 

4 

Right  Edges  u 
Edge  Cost  (First)  Matrix 


5 

nr 

1.7 

TTT" 

4.7 

7* 

Lef  t  4 

0 

1.7 

3.3 

ay 

4.7 

Edges  3 

0 

1.7 

3.3  ( 

3.8 

3.8 

2 

0 

1.7 

3.3 

3.0 

1 

0 

:> 

'  1 .7 

1.7 

1.7 

0 

0 

0 

0 

0 

0 

0 

1 

2 

3 

4 

Right  Edges 
Second  Matrix 


The  path  on  the  second  matrix  gives  the  best 
match.  Matches  are:  (1,1),  (2,2),  (3,4),  (4,5) 

Fig.  5  Edge  Costs  and  Optimal  Edge  Matches  for 
the  Images  in  Fig.  3. 


MODIFICATION 

In  the  above  discussion  we  considered  edges, 
which  are  longer  than  a  threshold,  running  across 
several  scanlines.  If  one  desires  to  include 
edgels  that  do  not  belong  to  edges  in  the  match¬ 
ing,  the  previous  method  needs  modification.  To 
study  this,  let  us  consider  the  right  and  left 
images  shown  in  Fig.  6.  This  is  similar  to  Fig. 

3  except  that  edgels  on  individual  scanlines  are 
shown.  The  edgels  lie  between  longer  edges  whose 
correspondence  is  determined  by  the  method 
described  previously.  We  want  to  determine  the 
correspondence  of  these  edgels  (intervals)  on 
right  and  left  scanlines  under  the  constraint  of 

matching  edges  in  the  right  and  left  images.  The 
edge  matching  acts  as  a  true  (hard)  constraint. 


TVo  approaches  are  suggested  below  to  modify 
the  method  described  earlier  to  accommodate  edgel 
matching  between  edges. 


770 


edge!  interval 


scanline  edge 


- ft— 

1  2 

12  3  4 

.e  /  /  f 

- 1 — 

3  4 

5 

g 

5  6 

6 

h  k 

S) 

2 

(v.j)  / 
© 

lL_J 

0  <4 

£ _ ! 

1  m  /  n  o  p 

Right  Image 


The  number  in  the  circle  next  to  the  edge 
is  its  index.  On  each  scanline,  edgels 
are  labeled  in  increasing  order  from  left 
to  right. 

Fig.  6  Images  Showing  Edgels,  Edges,  Intervals 
and  Scanlines. 


Left 

Edgel 

Indices 


2D  node 
belonging  t* 
3D  node  45 


15 

- E 

35 

r© 

14 

24 

34 

44 

13 

23 

33 

34 

12 

Use  2D  search  in 
this  subarray  to 
get  value  C012 

22 

optimize 
to  C022 

32 

42 

11 

Use  2D  search  in 
this  subarray  to 
get  value  C011 

— 

21 

31 

41 

3  4  5  6 


Right  Edgel  Indices 
For  Scanline  2  in  Example  2 


Fig.  7  Illustration  of  the  Calculation  of  the  Op¬ 
timal  Cost  of  Matching  Edgels  between 
Edges. 


The  additional  computation  for  this 
modification  in  the  method  is 

T3MN 


Approach  1 : 

This  approach  involves  modification  of  step 
5  in  the  method.  Suppose  that  edge  2  in  the 
right  image  matched  with  edge  2  in  the  left 
image,  then  the  edgels  in  ab  need  to  match  with 
edgels  in  a'b'  on  scanline  1,  and  edgels  in  fg 
need  to  match  with  edgels  in  g'h'  on  scanline  2 
and  edgels  in  mn  need  to  match  with  edgels  in 
o'p'  on  scanline  3.  Therefore,  we  have  multiple 
edgels  (intervals),  instead  of  a  single  edgel 
(interval),  between  edges  on  each  scanline  for 
matching. 

For  each  scanline,  we  use  2D  search  to  match 
edgels  that  are  between  edges.  The  optimal  cost 
of  siatching  edgels  between  edges  is  used  to  ob¬ 
tain  edge  cost  as  below: 

co>tu,v(s)  -rt  Co,u,v(t) 

The  summation  is  over  common  scanline  pairs 
(t  is  the  scanline  index). 

The  calculation  of  optimal  cost  for  matching 
edgels  between  edges  on  scanline  2  is  shown  in 
Fig.  7. 


where 

T  “  number  of  scanlines  in  the  images 

M  «  average  number  of  edgels  in  one  scanline  in 
the  right  image 

N  "  average  number  of  edgels  in  one  scanline  in 
the  left  image 

The  total  number  of  optimization  computa¬ 
tions  =  3TMN  +  3UV. 

The  estimate  is  obtained  as  follows:  Each 
scanline  requires  3MN  computations,  there  are  T 
scanlines,  and  edge  matching  requires  3 UV  com¬ 
putations. 

Approach  2 : 

He  first  obtain  edge  matches  using  the  six 
steps  in  the  method  and  an  appropriate  threshold. 
Since  the  edge  matches  act  as  constraints,  we  can 
match  the  edgels  in  betveen  matched  edges  by  ap¬ 
plying  the  2D  search  only  in  the  subarrays  which 
belong  to  the  matched  3D  nodes  (edges). 


771 


The  total  number  of  computations  are  much 
less  than  in  approach  1  because  the  2D  search  is 
applied  only  in  the  subarrays  belonging  to 
matched  3D  nodes  (<.  Min  (U,V)). 

IMPLEMENTATION 

The  stereo  matching  method  described  in  this 
paper  is  implemented  on  a  Symbolics  3600  machine 
using  common  LISP.  The  implementation  details  of 

the  first  four  steps  of  the  method  are  described 
below: 


EDGEL  DETECTION  AND  LINKING 

We  detect  edgels  by  applying  the  operator 
described  in  [7].  An  edge  in  an  image  corresponds 
to  a  discontinuity  in  the  intensity  surface  of 
the  underlying  scene.  It  can  be  approximated  by 
a  piecewise,  straight  curve  composed  of  edgels, 
i.e.,  short,  linear  edge-elements,  each  charac¬ 
terized  by  a  direction  and  a  position.  The  ap¬ 
proach  to  edgel  detection  is  to  fit  a  series  of 
one-dimensional  surfaces  to  each  window.  The 
operator  in  [7]  produces  a  list  of  edgels  by  fit¬ 
ting  a  directional  tanh-surface  to  windows  in  the 
images.  These  edgels  are  then  linked  and  fitted 
with  conic  sections  and  straight  lines.  The 
edgel  detection  is  robust  and  has  subpixel  posi¬ 
tion  accuracy  and  an  angular  localization  better 
than  10  .  Edges  are  formed  by  an  edge-linking 
process  such  that  two  edges  in  an  image  do  not 
intersect,  and  an  edge  does  not  cross  a  scanline 
more  than  once. 

EDGE  ORDERING 

The  edges  are  ordered  by  following  a  proce¬ 
dure  which  is  similar  to  the  one  in  [10].  The 
edges  which  run  across  the  same  scanline  are  lo¬ 
cally  ordered  from  left  to  right.  This  is  done 
independently  on  each  scanline.  Then  a  graph 
representing  this  local  order  is  generated  with 
nodes  as  edges  and  directed  arcs  showing  the  lo¬ 
cal  ordering  between  them.  For  each  node,  the 
maximum  number  of  arcs  from  the  leftmost  node  to 
that  node  is  calculated.  The  edges  are  ordered 
by  the  increasing  order  of  their  maximum  number 
of  arcs.  If  the  edges  have  the  same  maximum  num¬ 
ber  of  arcs,  then  the  edge  which  starts  earlier 
is  given  the  smaller  index.  This  assignment 
scheme  is  neither  optimal  nor  unique. 

EDGEL  (INTERVAL)  SIMILARITY  COST 

The  step  4  in  the  method  is  the  calculation 
of  the  edgel  (interval)  similarity  coat.  The 
similarity  coat  (edgel£  ;(t))  i8  based  on  the  in¬ 
terval  ratio  similarity  function,  the  edgel 
orientation  similarity  function,  and  tbe  edgel 
contrast  similarity  function.  We  took  similarity 
coat  as  proportional  to  the  product  of  all  three 
similarity  functions.  In  computing  the  coat,  we 
look  for  interval  ratio  aimilarity,  orientation 
aimilarity,  and  the  contraat  aimilarity. 


The  interval  ratio  and  edgel  orientation 
measures  have  been  used  by  Arnold  [8]  and  Baker 
[9]  in  tbe  coat  evaluation  function.  They 
treated  them  aa  probabilities  and  evaluated  the 
edgel  cost. 

In  this  paper,  specific  functions  have  been 
defined  for  these  three  measures  and  the  follow¬ 
ing  paragraphs  explain  these  functions. 

For  an  object  surface,  its  image  at  a  par¬ 
ticular  epipolar  line  will  generally  consist  of 
two  boundary  edges  and  the  interval  between  them. 
If  the  object  surface  is  visible  to  both  cameras, 
there  will  be  a  corresponding  interval  in  each 
image.  The  lengths  of  these  intervals  are  re¬ 
lated  to  tbe  angle  of  tbe  surface  and  to  the 
camera  geometry.  For  small  or  moderate 
baselines,  the  lengths  are  comparable. 


< - Baseline  - - > 


Fig.  8  Calculation  of  Interval  Ratio  Similarity 
Function. 


Arnold  [8]  has  shown  that  the  probability 
density  of  tbe  correspondence  of  the  two  inter¬ 
vals  LI  and  Lr  in  the  two  images  is  maximum  when 
LI  «  Lr. 

We  have  chosen  interval  ratio  similarity 
function  fi  to  be 

fi  «  e-10(l-r)  for  B/z  ■*  .07 

»  r  for  B/z  -  .2 

where 

Minimum  (Ll,Lr) 

r  - - 

Maximum  (Ll,Lr) 

For  a  corresponding  pair  of  edgels,  one  in 
tbe  left  image  and  one  in  tbe  right  image,  we  are 
interested  in  tbe  similarity  of  their  angles.  We 
adopted  tbe  following  similarity  function,  fo, 
for  edgel  orientation: 


772 


(1)  if  B /z  -  .07 

fo  “  (1  -  — —  )  for  /A9/  <  3 

3 

»  0  for  /  A(y  >.  3 

(2)  if  B/z  -  .2 

fo  *  (1  -  Afl_)  for  /  A  0/  <  15 

15 

-  0  for  /  A  0/  >  15 

where 

A0  *  (9  1  -  0  r) 

01  ■  edgel  orientation  with  respect  to  epipolar 
line  in  the  left  image 

9r  »  edgel  orientation  with  respect  to  epipolar 
line  in  the  right  image 

0r  >  10 

and 

91  >  10  . 

The  edgel  contrast  similarity  function,  fc,  is 
given  by 

fc  “  1.0  if  contrast  of  edgel  in  the  left  image  = 
contrast  of  edgel  in  the  right  image 


The  matched  segments  in  the  left  and  the 
right  images  are  displayed  separately.  The  in¬ 
formation  about  which  segments  (edges)  in  one 
image  match  with  which  in  the  other  image  is 
prov  ided. 

STANFORD  UNIVERSITY  BLOCKS  IMAGE 

We  first  applied  our  method  to  this  image  of 
some  blocks,  cylinders  and  spheres  as  shown  Fig. 
9.  The  image  size  is  256  x  256  pixels  and  the 
intensity  resolution  is  8  bits.  Fig.  10  is  the 
result  from  edge  detection,  linking  and 
thresholding  (on  the  length  of  edges:  pixels). 
The  number  of  edgels  in  the  right  and  left  images 
is  about  1700.  The  edges  in  each  image  is  about 
55.  The  number  attached  to  each  edge  indicates 
its  ordering  index. 

Fig.  11  indicates  the  matched  segments 
(edges).  The  number  attached  to  each  edge  in 
this  figure  still  indicates  its  ordering  index. 
There  were  few  edges  in  Fig.  11  which  were  not 
matched  even  though  the  visual  inspection  indi¬ 
cates  that  they  should  have.  This  can  be  mainly 
attributed  to  the  nonunique  ordering  of  edges  in 
the  two  images.  The  omitted  edges  which  have  the 
same  number  of  maximum  number  of  arcs  are  given 
an  arbitrary  ordering  index  following  a  procedure 
suggested  in  [10],  No  incorrect  matches  were  ob¬ 
served.  We  essentially  captured  the  shape  of  the 
objects  in  both  images  except  for  two  edges  in 
the  left  part  of  the  images  which  were  not 
matched  due  to  the  arbitrary  indexing  of  these 
edges. 


-  0  otherwise 


AERIAL  VIEW  OF  WHITE  HOUSE 


RESULTS 

This  paper  addresses  mainly  the  identifica¬ 
tion  of  corresponding  points  in  the  two  images. 
Since  our  method  generates  only  a  sparse  dis¬ 
parity  map,  it  is  difficult  to  display  the 
results  in  a  very  convenient  visual  form.  Proper 
display  requires  the  construction  of  an  inter¬ 
polating  surface  117],  and  good  interpolation 
requires  object  segmentation  first.  This  is  not 
done  here  because  it  is  a  large  problem  by  it¬ 
self. 


Left  Image 


The  second  image  we  took  is  a  256  x  256  por¬ 
tion  of  the  512  x  512,  8-bit  intensity  resolution 
picture  of  the  White  House  (Fig.  12).  This  is  an 
interesting  image  because  it  contains  buildings 
and  highly  textured  trees.  Fig.  13  is  the  result 
from  edge  detection,  linking  and  thresholding  (on 
the  length  of  edges).  The  number  of  edgels  in 
the  left  and  right  image  is  about  3800.  There 
are  about  140  edges  in  each  image.  As  in  the 
previous  example,  the  number  attached  represents 


Right  Image 


Fig.  9  Stanford  University  Blocks  Image 


773 


Left  Image 


Right  Image 


Fig.  10  Edges  in  the  Right  and  Left  Images 


Left  Image 


Right  Image 


Fig.  11  Matched  Edges  (Segments) 


Fig.  12  Aerial  Viev  of  White  House 


Left  Image  Right  Image 


Fig.  13  Edges 


.•ft 


Left  Image  Ri8ht  *“*« 


Fig.  14  Matched  Edges 

the  ordering  index.  Fig.  14  indicates  the  tions  for  edge  matching  and  additional  3TMN  corn- 

matched  segments  (edges).  Although  basic  feature  putations  for  matching  edgels  in  between  edges, 

elements  have  been  successfully  matched,  the  per-  The  method  works  well  for  images  which  have  long 

centage  of  matched  segments  is  only  25Z  of  total  edges.  The  method  offers  an  efficient  way  for 

segments.  This  small  percentage  is  due  to  an  ar-  stereo  matching, 

bitrary  ordering  scheme.  He  consider  the  current 
results  a  preliminary  finding. 

ACKNOWLEDGMENTS 

The  authors  would  like  to  extend  their 

CONCLUSION  gratitude  to  V.S.  Nalwa  for  providing  the  neces¬ 
sary  data  for  the  work.  He  thank  David  Kriegman 

He  have  described  a  stereo  matching  method  for  h£s  hexp  in  the  display  programming  and  Ross 

which  provides  optimum  matching  for  edges  D  sbachter  for  U8eful  discussions. 

(intervals  between  edges)  in  the  two  stereo 
images.  The  a  Igor  it  la  requires' only  30V  cosiputa- 


775 


1. 


REFERENCES 

D.  Gennery ,  Object  detection  end  measurement 
using  stereo  vision,  in  Proc.  Image  Pnder- 
s tending  Workshop.  College  Perk,  Md. ,  Apr. 
1980,  161-167. 

2.  H.  Moravec,  "Obstscle  evoidence  end  naviga- 
tion  in  the  reel  world  by  e  Seeing  robot 
rover,"  Stenford  Artificiel  Intelligence 
Laboratory  AIM  340;  Ph.D.  thesis,  Stenford 
Oniversity,  Sept.  1980. 

3.  R.  Henderson,  R.  Miller,  end  C.  Grosch, 
Autoastic  stereo  reconstruction  of  msn-oede 
targets:  Digital  processing  of  aerial 

images,  Proc.  Soc.  Photo-Opt.  InstrunL, _ |ggit 

186,  No.  6,  1979,  240-248. 

4.  R.  Kelly,  P.  McConnell,  and  S.  Mildenberger, 
the  gestalt  photonapping  system,  Photogram^ 
Metric  Eng.  Remote  Sens.  43 ,  No .  1407,  1977. 

5.  D.  Panton,  A  flexible  approach  to  digital 
stereo  mapping,  Photogremn.  Eng.  Remote 
Sens.  44,  No.  12,  1978,  1499-1512. 

6.  S.  Barnard  and  M.  Fishier,  Computational 
stereo,  ACM  Computing  Surveys  14,  No.  4, 

1982,  553-572. 

7.  T.S.  Nalwa  and  T.O.  Binford,  On  detecting 
edges,  IEEE- Transact ions  Pattern  Anal ■  MacjK 
Intell.  P AMI-8 ,  Nov.  1986,  699-714. 

8.  R.D.  Arnold,  "Automated  stereo  perception," 
Stanford  Artificial  Intelligence  Laboratory 
Tech,  report  AIM-347,  Stanford  University, 
1982. 

9.  H.H.  Baker  and  T.O.  Binford,  Depth  from  edge 


and  intensity  based  stereo,  in  Proc.  7th 
Int.  Jt.  Conf . .  A. I.,  Aug.  1981. 

10.  Y.  Ohta  and  T.  Kanade,  Stereo  by  Intra-  and 
Inter-  scanline  search  using  dynamic 
programing,  IEEE  Trans.  .  PAMI,  Mar.  1985. 

11.  M.J.  Hannah,  "Computer  matching  of  areas  in 
stereo  imagery,”  Ph.D.  thesis,  Stanford 
University,  1974. 


12.  D.J.  Panton,  A  flexible  approach  to  digital 
stereo  mapping,  Phntpqr.^,  Eng.  Renote 
Sens.  44,  No.  12,  1978. 

13.  G.M.  Medioni  and  R.  Nevatia,  Segment-based 
stereo  matching.  Computer  Vision.  Graphics, 
and  Image  Processing  31,  1985. 

14.  D.  Marr  and  T.  Poggio,  "A  theory  human 
stereo  vision,"  MIT  AI  memo.  No.  451,  Nov. 
1977. 

15.  D.  Arnold  and  T.  Binford,  Geometric  con¬ 
straints  in  stereo  vision:  Image  processing 
for  missile  guidance,  Proc.  Soc.  Photo-Ont. 
Instrum.  Engr.  238,  1980,  281-292. 

16.  H.H.  Baker,  "Depth  from  edge  and  intensity 
based  stereo,”  Stanford  Artificial  Intel¬ 
ligence  Laboratory  Tech,  report  AIM-347, 
Stanford  University,  1982. 

17.  W.  Crimson,  From  Images  to  Surfaces.  MIT 
Press,  Cambridge,  Mass.,  1981. 

18.  H.S.  Lim  and  T.O.  Binford,  Stereo 
correspondence:  Features  and  constraints, 
in  Proc.  Image  Understanding  Workshop,  Dec. 
1985,  373-380. 


i 


r 


776 


Steps  Toward  Accurate  Stereo  Correspondence1 


Steven  D.  Cochran 

Institute  for  Robotics  and  Intelligent  Systems 
School  of  Engineering 
University  of  Southern  California 
Los  Angeles,  California  90089-0273 


Abstract 

This  is  a  preliminary  report  for  a  system  which  obtains  an 
accurate  determination  of  3-D  surfaces  from  passive  stereo. 
We  present  an  overview  of  the  proposed  system,  which  en¬ 
compasses  aspects  of  Image  Matching,  Depth  Determina¬ 
tion,  and  Interpolation.  In  Image  Matching  a  majoV  source 
of  error  in  the  exact  matching  of  edge  elements  is  the  loss 
of  edgels  near  corners.  A  new  method  is  presented  which 
will  fill  these  gaps  in  a  reasonable  manner.  The  approach 
used  first  marks  the  junctions  in  the  image  with  a  labeling 
which  includes  the  orientation  of  the  associated  edge  gra¬ 
dients.  Next,  those  edges  which  lie  near  the  boundary  of 
blobs  in  the  image  are  checked  and  where  problem  junctions 
occur  (violations  of  an  “at  least  one-in  and  one-out ”  rule), 
a  search  is  made  along  the  boundary  of  the  associated  blob 
for  the  best  candidate  to  complete  the  boundary.  Initial 
results  for  the  gap-filling  process  are  shown  for  real  images. 


1  Introduction 

Human  Beings  are  able  to  perceive  depth  in  2-D  “monocu¬ 
lar”  images  and  in  an  enhanced  form  from  the  stereoscopic 
combination  of  a  pair  of  images.  The  process  used  by  the 
Human  Visual  System  to  do  this,  however,  is  not  well  un¬ 
derstood.  Much  research  has  been  devoted  to  the  auto¬ 
matic  abstraction  of  information  about  objects  in  images, 
in  order  to  produce  autonomous  systems  which  are  able 
to  “perceive,”  and  operate  upon  their  anvironments  and, 
in  some  cases,  to  gain  insight  about  the  operation  of  the 
Human  Visual  Syt  iem. 

The  major  problem  with  current  stereo  correspondence 
algorithms  such  as  (1,2, 3, 4, 5, 6, 7,8,9],  is  that  they  have 
problems  when  dealing  with  added  or  missing  features  due 
to  noise  or  occlusion  in  real  images.  In  addition,  it  is  often 


lThi*  research  was  supported  by  the  Defense  Advanced  Research 
Projects  Agency  under  contract  F33615-84-K-1404,  monitored  by  the 
Air  Force  Wright  Aeronautical  Laboratories,  DARPA  order  number 
33119,  and  by  a  TRW  Fellowship. 


difficult  to  determine  in  detail  the  correct  disparity  of  the 
matched  features  in  feature-based  methods  since  the  same 
abstraction  that  allows  a  feature  to  be  represented  also  in¬ 
troduces  some  ambiguity  in  how  to  provide  a  point-by-point 
pairing  of  the  feature  in  one  image  with  its  corresponding 
“fine-structure”  of  the  feature  in  a  second  image. 

The  recovery  of  the  3-D  characteristics  of  a  scene  from 
multiple  images  taken  from  different  points  of  view,  may  be 
viewed  as  consisting  of  the  following  steps  (from  Barnard 
and  Fischler  [10]) : 

1.  Image  Acquisition 

2.  Camera  Modeling 

3.  Feature  Acquisition 

4.  Image  Matching 

5.  Depth  Determination 

6.  Interpolation 

We  plan  to  solve  the  Depth  Determination  step  of  this 
process.  In  order  to  do  this  we  will  concentrate  on  the  last 
three  steps,  detecting  and  correcting  errors  in  the  Image 
Matching  step,  performing  a  partial  Depth  Determination, 
followed  by  an  estimation  of  the  object  surfaces  which  will 
allow  the  Interpolation  step  to  re-analyze  the  matching  to 
give  a  final,  accurate,  determination  of  the  depth. 

1.1  Initial  Results  in  Gap-Filling 

We  present,  in  this  paper,  our  initial  results  in  gap-filling  as 
an  initial  step  in  the  correction  of  the  Image  Matching.  In 
our  analysis  of  the  gap-filling,  we  concentrate  on  detecting 
the  gaps  formed  at  corners  since  they  are  a  particularly 
interesting  example  and  the  technique  simplifies  to  handle 
more  general  edges. 

Corner  detection  is  an  important  research  area  in  com¬ 
puter  vision  and  image  processing.  When  complete  curves 
are  present,  the  task  is  the  determination  of  which  point  or 
points  form  the  corner  and  finding  its  radius  of  curvature. 
Various  corner  detectors  of  this  sort  have  been  developed 
and  are  reported  in  (11,12,13). 


777 


A  problem  occurs  when  the  curve  is  fragmented  due 
either  to  the  assumptions  made  by  the  underlying  edge  de¬ 
tection  or  to  a  low  edge  contrast.  The  former  may  be  im¬ 
proved  by  better  edge  detection  algorithms,  but  the  latter 
remains  as  a  source  of  gaps  in  the  curves  which  are  often 
accentuated  at  the  corners  due  to  the  interaction  of  two  (or 
more)  edges.  Processing  the  intensity  images  in  Figure  1 
yields  several  examples  of  such  lost  corners,  these  are  shown 
in  Figure  2.  When  a  complete  curve  is  not  available,  some 
method  must  be  introduced  to  fill-in  the  missing  points 
One  example  of  this  approach  is  [14],  in  which  Nevatia 
and  Huertas  use  knowledge  of  the  scene  domain  to  pre¬ 
dict  the  expected  corner  and  then  use  shadows  to  confirm 
their  hypothesis.  Their  hypotheses  assume  approximately 
orthogonal  L-  or  T-junctions.  While  this  is  useful  for  some 
domains  like  building  detection  in  aerial  images,  it  does 
not  take  into  account  non-parallelepiped  structures  such  as 
those  in  Figure  1  (a).  We  propose  to  combine  the  informa¬ 
tion  extracted  from  the  original  intensity  images  via  both 
feature-  and  area-based  approaches. 

2  System  Description 

We  have  decided  to  use  a  feature-based  approach  since  our 
desire  is  for  an  accurate  system.  The  feature-based  meth¬ 
ods  give  us  the  potential  for  sub-pixel  accuracy  at  the  corre¬ 
sponding  points  which  in  turn  gives  us  a  better  triangulated 
range.  The  major  problems  with  these  methods  are: 

1.  Missing  and  Extra  features, 

2.  Mismatched  features, 

3.  Unmatched  features,  and 

4.  Sparsity  of  data. 

Accuracy  in  a  two  camera  system  is  best  along  edges 
perpendicular  to  a  plane  through  the  baseline  between  the 
cameras.  This  is  because  the  epipolar  geometry  provides 
an  additional  source  of  information  which  is  lacking  when 
the  corresponding  points  are  contained  in  such  an  epipolar 
plane.  Thus  not  all  point  correspondences  are  equally  good. 

We  plan  to  augment  a  feature-based  method  developed 
by  Medioni  and  Nevatia  [4]  to  detect  and  correct  all  of  these 
problems.  Mismatches  will  be  identified  and  corrected  by 
extending  the  work  of  Mohan  [6]  from  edgels  to  contours. 
Missing  edges  and  gaps  will  be  filled  by  a  modified  form  of 
model  matching  after  Falk  [15],  Shirai  [16,17]  and  Shapira 
and  Freeman  [18],  along  with  a  newly  developed  join  la¬ 
beling  scheme  and  the  gap-filling  (both  described  in  this 
paper).  Surfaces  will  be  located  by  analysis  of  contour  and 
surface  marking  edges  to  define  a  dense  matching. 

We  present  below,  an  overview  of  our  proposed  system 
for  the  detection  and  correction  of  stereo-correspondence 
errors  that: 


1.  Integrates  two-  and  three-dimensional  continuity. 

2.  Hypothesizes  corrections  to  the  observed  image  fea¬ 
tures. 

3.  Performs  exact  matching  of  the  features  at  the  pixel 
(or  subpixel)  level  to  determine  disparity  where  possi¬ 
ble  and  marks  those  places  where  disparity  cannot  be 

determined  without  inferring  the  underlying  geome¬ 
try  of  the  object  or  objects  making  up  the  scene. 

4.  Hypothesizes  faces  or  surfaces  in  three  dimensions 
consistent  with  the  observed  and  matched  features. 

5.  Propagates  disparity  across  those  features  whose  dis¬ 
parity  could  not  be  found  in  the  third  step  above, 
guided  by  the  hypothesized  surfaces. 

We  use  the  hypothesize-and-test  paradigm  in  three  dif¬ 
ferent  areas  to  detect  and  correct  errors  in  matching,  to 
locate  missing  features,  and  to  build  object  surfaces.  When 
possible,  each  hypothesis  is  verified  and  the  matching  is  cor¬ 
rected  by  using  the  connectivity  and  geometry  of  the  3-D 
scene  being  constructed.  High  confidence  matches  are  ini¬ 
tially  accepted  as  valid  hypotheses  and  disparities  assigned. 
Where  a  hypothesis  cannot  be  verified,  it  is  carried  forward 
in  the  analysis.  In  the  event  that  a  hypothesis  cannot  be 
verified  within  the  scope  of  our  analysis  (for  instance,  some 
perceived  edges  may  be  missed)  it  becomes  part  of  the  anal¬ 
ysis  and  serves  the  purpose  of  alerting  higher-level  processes 
of  an  unresolved  problem. 

We  assert  that  typical  errors  in  passive  feature-based 
stereo-correspondence,  such  as  those  listed  at  the  beginning 
of  this  section,  may  be  corrected  by  the  reprocessing  of  the 
matched  features  in  two  or  more  images.  Also,  that  in 
order  to  assign  disparity,  the  matching  between  features 
must  be  extended  to  its  component  parts  which  cannot  be 
completely  accomplished  without  first  inferring  the  object 
surface  that  generated  these  features. 

The  initial  stereo  matching  allows  us  to  investigate  the 
monocular  continuity  of  lines  in  one  image  relative  to  the 
other.  Inconsistencies  indicate  errors  in  either  the  feature 
extraction  or  in  the  matching.  Such  errors  are  usually 
caused  by  noise  or  occlusion:  but  sometimes  are  due  to 
a  missing  line  whose  existence  must  be  hypothesized  from 
the  image  data  and  knowledge  about  the  world.  (These 
later  sources  of  error  are  called  perceived  edges.)  During 
this  phase  we  seek  to  directly  correct  as  many  errors  as 
possible  and  attempt  to  detect  the  rest,  although  some 
errors  may  not  be  detectable  at  this  stage.  The  detected 
inconsistences  which  cannot  be  corrected  generate  hypothe¬ 
ses  which  are  used  as  markers  to  guide  the  application  of 
information  gathered  during  the  subsequent  processing. 

The  ultimate  goal  of  this  system  will  be  to  provide  an 
accurate  and  robust  stereo  matcher  that  correctly  matches 
a  set  of  features  in  an  image  pair.  In  addition  we  will  be 
able  to  provide  a  dense  disparity  map  for  the  scene  which  is 


778 


consistent  with  the  observed  features  and  that  is  produced 
using  passive  imaging  techniques.  A  list  of  unresolved  hy¬ 
pothesis/problem  entries  will  serve  to  alert  higher-level  pro¬ 
cesses  of  problems  with  the  interpretation  which  may  need 
the  application  of  more  scene  specific  knowledge  such  as 
assumptions  of  object  type  to  be  resolved.  In  addition,  fea¬ 
ture  abstraction,  labeling  and  matching  information  will  be 
available  so  that  it  will  be  possible  to  feed  back  informa¬ 
tion  to  the  system  or  for  it  to  serve  as  additional  input  to 
the  high-level  processing.  We  view  the  proposed  system  as 
being  an  important  link  between  existing  low  and  medium 
level  feature  extraction  systems  and  higher  level  systems 
such  as  ACRONYM  [19]  or  as  an  additional  channel  for  sys¬ 
tems  like  MOSAIC  [20]. 

2.1  Choice  of  Primitives 

As  part  of  our  overall  strategy  we  will  be  using  the  following 
features  to  extract  accurate  stereo  correspondences  between 
points  in  a  pair  of  passively  acquired  intensity  images.  The 
purpose  of  this  paper  is  to  show  how  the  missing  corners 
may  be  found.  We  are  especially  interested  in  junctions 
into  which  there  are  no  edges  entering  or  from  which  there 
are  no  edges  exiting.  The  most  common  such  junction  is 
the  E-junction  which  represents  the  endpoint  of  an  isolated 
curve  or  segment. 

The  problem  that  we  wish  to  solve  is  that  of  finding 
an  accurate  representation  for  the  visible  surfaces  of  a 
set  of  objects  in  a  scene,  by  the  analysis  of  a  pair  of  stereo 
intensity  images  of  that  scene  taken  by  two  cameras  a  small 
distance  apart.  Therefore  we  use  feature-based  methods, 
since  they  provide  better  accuracy  by  matching  features 
which  may  be  found  to  sub-pixel  precision.  The  feature 
primitives  that  we  will  use  are  edgels,  segments,  junctions, 
regions,  and  blobs.  We  choose  these  features  because  they 
are  indicative  of  intensity  discontinuities  which,  typically, 
occur  only  at  object  boundaries,  at  surface  discontinuities 
and  across  surface  markings  (eg.  in  textured  regions). 

EDGELS  are  the  fundamental  feature  that  we  use. 
They  represent  the  location  and  orientation  of  a  zero-cros¬ 
sing  of  the  second  derivative  of  the  intensity  image,  located 
to  pixel  or  sub-pixel  accuracy.  Their  orientation  is  perpen¬ 
dicular  to  the  direction  of  the  maximum  change  in  intensity 
such  that  the  lighter  side  is  to  the  right.  They  will  serve  as 
the  source  of  our  most  accurate  information  about  the  posi¬ 
tion  of  interesting  points.  We  divide  the  edgels  by  two  clas- 
sifici  'ions  into  four  groups:  the  stronger  and  the  weaker, 
based  on  (1)  their  strongest  response  to  an  edge  mask  and 
(2)  their  alignment  with  respect  to  nearby  strong  edgels; 
and  vertical  — vs —  horizontal,  based  on  their  alignment 
with  respect  to  the  epipolar  rays. 

CURVES  represent  connected  sequences  of  edgels. 
They  provide  an  ordered  progression  through  the  image  and 
may  be  used  to  propagate  (both  monocular  and  stereo)  rela¬ 
tionships  through  the  image,  subject  to  the  restriction  that 


they  may  (1)  have  some  non-invariant  shifts,  e.g.  specula* 
edges  shift  with  even  small  changes  in  viewpoint;  (2)  wan¬ 
der  from  one  object  to  another  which  is  adjacent  in  2-D  but 
not  necessarily  in  3-D. 

LINEAR  SEGMENTS  are  a  further  abstraction  of 
the  edgels  and  represent  straight  or  nearly  straight  sequences 
of  them.  These  are  the  most  useful  features  for  stereo 
matching  because  they  are  simple  and  closely  match  the  im¬ 
portant  edges  in  the  scene.  However,  their  accuracy  is  not 
very  good  and  they  should  not  be  used  for  depth  determi¬ 
nation.  Segments  may  be  labeled  as  being  on  the  boundary 
or  in  the  interior  of  a  blob  (see  below). 

JUNCTIONS  (or  JOINS)  are  the  edgels  at  the  inter¬ 
section  of  curves  and/or  linear  segments.  They  are  used 
to  indicate  points  at  which  matching  may  be  propagated 
due  to  continuity.  In  addition,  some  join  types  are  very  un¬ 
likely.  These  unlikely  joins  will  be  used  to  suggest  possible 
problem  areas  such  as  those  places  where  segments  were 
missed.  We  use  a  labeling  scheme  in  which  we  retain  a  list 
of  properties  at  each  junction.  These  properties  include: 

•  The  total  number  of  intensity  edges  entering  or  leav¬ 
ing  the  junction  (the  minimum  is  1,  the  maximum 
used  is  3). 

•  The  number  of  intensity  edges  entering  the  junction, 
and  hence  (along  with  the  total  edges  count)  the  num¬ 
ber  leaving. 

•  The  junction  type  —  in  the  traditional  L,  T,  Y, 
W  sense,  to  which  we  add  S  for  nearly  straight  se¬ 
quences,  E  for  lone  endpoints,  and  X  for  junctions 
composed  of  more  than  three  edges. 

•  The  arrangement  of  the  junction  in  terms  of  the  gen¬ 
eral  angle  between  the  edges.  We  use  the  values: 

A  Acute 

L  Square  (approximately) 

O  Obtuse 

S  Straight  (approximately) 

REGIONS  are  another  way  of  viewing  the  intensity 
discontinuities.  We  will  use  them  to  help  with  gap  fill¬ 
ing  and  with  grouping  together  features  which  form  the 
members  of  object  faces.  We  locate  regions  by  recursively 
splitting  an  image  using  a  threshold  defined  by  histogram 
after  [21].  Associated  with  the  regions  are  simple  attributes 
such  as  area,  perimeter,  average-intensity,  and  orientation 
as  well  as  higher-order  attributes  such  as  perimeter’ / area 
and  the  s.d.  of  intensity. 

BLOBS  are  a  subset  of  the  regions  which  meet  the  fol¬ 
lowing  set  of  criteria.  These  criteria  are  designed  to  admit 
only  the  semantically  meaningful  regions  while  excluding 
the  rest. 


779 


1.  The  regions  must  have  no  descendant.  That  is,  it 
must  be  a  leaf-node  in  the  hierarchical  tree  of  regions. 

2.  Its  area  must  be  greater  than  some  minimum. 

3.  The  region  must  be  compact. 

However,  we  desire  to  accept  and  work  with  real  world 
scenes  and  therefore  we  do  not  assume  that  our  data  is  per¬ 
fect,  rather,  we  expect  missing  and  extra  edges  and  junc¬ 
tions  due  to  noise  and  conditions  that  we  assume  do  not 
exist  in  our  analysis  but  which  do  occasionally  occur  and 
produce  errors  during  the  low-level  and  preprocessing  steps. 
We  feel  that  a  realistic  system  must  be  prepared  to  detect 
and  correct  such  errors  while  at  the  same  time  choosing 
heuristics  which  generally  serve  to  quickly  and  accurately 
determine  the  components  of  the  scene. 

During  the  2-D  processing  we  use  both  of  a  pair  of 
stereo  images:  where  at  first,  one  is  considered  to  be  our 
monocular  image  while  the  second  acts  as  a  model,  and  then 
they  swap  roles.  An  initial  mapping  between  the  image  and 
the  model  is  provided  by  a  stereo  matcher  [4]  that  provides 
a  list  of  likely  matches  between  the  segments  in  the  two 
images. 

2.1.1  Join  Labeling 

The  following  are  the  rules  that  our  system  uses  to  label 
joins.  The  type  of  join  is  determined  by  the  number  and 
orientation  of  the  lines  which  form  it,  while  the  subscript 
in  our  nomenclature  gives  the  number  of  those  (intensity) 
oriented  segments  entering  the  join.  The  superscript  in¬ 
dicates  the  specific  join  from  the  variations  possible  under 
our  “ALOS”  model.  Each  such  variation  is  unique  in  terms 
of  rotation  and  reflection  from  any  other  variation.  The 
variations  show  the  combinations  of  acute  (A),  right  (L), 
obtuse  (O)  and  straight  (S)  angles  combined  with  the  rel¬ 
ative  line  orientations.  This  alternate  labeling  scheme  will 
be  used  to  provide  an  additional  measure  of  similarity  be¬ 
tween  junctions,  in  which  the  angles  between  the  segments 
changes  between  the  two  viewpoints.  This  will  be  described 
in  more  detail  in  Section  2.2.2.  The  angles  are  termed  right 
if  they  are  within  L-Threshold  of  90°,  and  straight  if  they 
are  within  S-Threshold  of  180°. 

1.  ‘E’-junctions  represent  the  endpoints  of  segments  that 
do  not  form  a  join. 

2.  Joins  of  exactly  2  segments  are  labeled  as  one  of  the 
following  two  types: 

(a)  L-junctions  are  those  joins  between  exactly  two 
segments  which  form  an  angle  which  is  not  con¬ 
sidered  straight. 


(b)  The  remaining  joins  are  considered  present  only 
for  the  linearization  of  curves  to  simplify  anal¬ 
ysis  (e.g.  knot  points).  For  labeling  purposes, 
the  joins  are  marked  as  S-junctions  indicating 
that  they  Me,  in  some  sense,  straighter  than  In¬ 
junctions. 

3.  Joins  of  exactly  3  segments  are  labeled  as  one  of  the 
following  types: 

(a)  T-junctions  if  one  pair  of  segments  form  a 
straight  angle. 

(b)  Y-junction  if  all  clockwise  angles  between  pairs 
are  less  than  180°—  S-Threshold. 

(c)  W-junction  if  one  clockwise  angle  between  pairs 
is  greater  than  180°—  S-Threshold. 

4.  Joins  of  higher  order  (i.e.,  composed  of  more  seg¬ 
ments)  than  3  are  labeled  as  multi-  or  X-junctions  and 
will  be  further  expanded  only  if  junctions  of  this  type 
occur  often  enough  to  make  this  extension  worth¬ 
while. 

Junction  Catalog  Figure  3  shows  all  possible  first,  sec¬ 
ond  and  third  order  joins.  Some  of  these  are  very  un¬ 
likely  because  they  require  a  smooth,  low  spatial  frequency, 
change  in  intensity  locally  about  the  join,  in  violation  of  the 
generality  assumption,  those  which  are  considered  likely  are 
labeled  as  “complete.”  The  unlikely  cases  can  be  used  to 
support  the  hypothesis  of,  or  a  search  for,  a  more  likely 
(complete)  join  type.  For  instance,  if  a  join  of  type  L%  were 
found  then  it  would  support  the  hypothesis  that  one  of  Yf  ', 
Wjbcd‘,  or  perhaps  one  of  r2°'  may  actually  be  present.  We 
submit  that  a  corollary  of  the  generality  constraint  is  that 
for  a  join  formed  by  the  meeting  of  lines  corresponding  to 
intensity  gradients  in  the  image,  at  least  one  segment  must 
enter  the  join  and  one  segment  must  exit  the  join. 

2.2  2-D  Processing 

During  the  2-D  processing  phase  of  our  system  we  integrate 
and  organize  the  information  supplied  by  the  preprocessing 
routines  above.  Segment  match  information,  along  with 
the  continuity  between  adjacent  segments  and  the  proxim¬ 
ity  of  other  features,  are  used  to  detect  and  correct  some 
of  the  matching  errors  produced  during  the  preprocessing. 
In  some  cases  a  problem  may  be  detected  but  the  correc¬ 
tion  cannot  be  determined.  We  retain  this  information  by 
adding  one  or  more  problem/hypothesis  entries  to  the  list 
of  unresolved  hypotheses. 

2.2.1  Continuity  Analysis 

The  set  of  connected  joins  may  be  used  to  propagate 
matches  (indicated  by  a  “«*■")  through  the  image.  If  we 


780 


assume  that  the  disparity  changes  smoothly  over  each  line, 
then  if  there  are  a  few  incorrectly  matched  points  along 
the  line,  they  may  be  corrected  by  setting  them  to  values 
interpolated  along  the  line  [6].  This  process  may  then  be  ex¬ 
tended  by  allowing  the  values  to  propagate  along  connected 
(indicated  by  a  lines  whether  or  not  they  have  been 

matched,  except  that  no  information  may  be  propagated 
through  the  “stem”  portion  of  the  ‘T’-junctions.  Also,  the 
goodness  of  the  match  varies  with  the  orientation  of  the 
lines  relative  to  the  local  epipolar-plane(s). 

For  instance,  if  we  have  two  matched  curves  Cj  <=>•  CT, 
as  shown  in  Figure  4,  we  may  hypothesize  that  At  o  Ar, 
Bt  Br,  Dt  O-  Dr,  and  E ,  O  Er. 

Incorrect  matches  may  also  be  detected  by  continuity. 
Figure  5  shows  a  case  of  a  bad  match  Bt  O-  Cr,  where 
A(  o  Ar  are  matched  and  Br  is  unmatched.  Here,  the  con¬ 
nectivity  of  Ai  'v*  Bt  and  A,  Br  along  with  the  matching 
above  implies  that  the  Bt  O  Cr  match  should  be  broken 
and  replaced  by  a  hypothesized  Bt  Br  match.  In  the 
case  where  connectivity  alone  does  not  allow  a  choice  of 
which  matches  to  prefer,  then  the  “goodness”  of  the  match 
may  be  used  to  resolve  the  tie. 

2.2.2  Hypothesis  Formation 

The  rules  for  Hypotheses  formation  based  on  join  labeling 
are: 

1.  There  is  always  at  least  one  line  entering  and  exiting 
a  junction.  This  comes  from  the  assumption  that  no 
chance  alignment  occurs,  and  allows  us  to  hypothesize 
in  many  cases  the  existence  and  approximate  orienta¬ 
tion  of  additional  edges  that  were  missed  during  the 
feature  extraction.  Often  these  hypotheses  will  later 
serve  as  the  guides  for  the  Gap-Filling  process. 

2.  Matched  junctions  have  the  same  number  of  incom¬ 
ing  and  outgoing  edges.  This  rule  is  weaker,  but  we 
are  assuming  acute-angle  stereo  which  the  baseline 
is  much  shorter  than  the  range  to  the  scene.  In  this 
scenario,  the  vertices  generate  similar  junctions  in  the 
image  planes  and  with  the  small  changes  that  can  oc¬ 
cur.  Figure  6  shows  a  graph  of  transitions  between 
junction  types. 

This  allows  us  to  hypothesize  the  most  likely  match¬ 
ing  junction  as  well  as  less  acceptable  alternatives 
should  other  evidence  support  an  alternative  match. 
These  hypotheses  when  combined  with  the  continu¬ 
ity  and  the  partial  match  data  will  allow  us  to  correct 
most  of  the  errors  which  occur  during  the  matching 
phase.  Table  1  gives  the  distance  between  the  junc¬ 
tion  types  shown  in  Figure  6,  assuming  a  unit  cost 
for  each  transition.  A  cost  of  0-2  is  reasonable,  but 
higher  costs  imply  an  unlikely  match. 


For  each  incomplete  junction,  the  component  lines  are 
extended  along  their  orientation,  and  along  any  associated 
region,  for  a  set  length  and  a  search  is  made  within  an  asso¬ 
ciated  sweep  angle.  Both  of  these  parameters  are  calculated 
locally  based  on  the  length  and  orientation  of  the  individual 
segments  and  the  strength  of  the  region  boundaries. 

2.3  Gap  Filling 

From  the  passively  acquired  intensity  images  we  extract  the 
edge  points  and  organize  them  into  linear  segments.  At  the 
same  time  we  extract  the  hierarchical  tree  of  regions. 

In  the  actual  system  the  linear  segments  in  a  pair  of 
stereo  images  are  matched  and  the  match  data  is  checked 
by  using  the  continuity  in  one  or  both  images.  When  cor¬ 
ners  are  lost,  it  is  often  the  case  that  the  corner  is  lost  in 
both  of  the  images,  so  we  will  consider  only  the  monocular 
application  at  this  time,  since  continuity  will  serve  to  fill 
the  gap  when  only  one  of  the  stereo  images  contains  a  gap. 

For  each  likely  endpoint  we  will  look  for  the  nearest  blob 
boundary  point  and  follow  the  contour  of  that  region  until 
we  locate  an  acceptable  continuation  segment/curve.  Then 
we  must  determine  the  exact  path  of  the  corner. 

2.3.1  Initial  Processing 

We  first  apply  a  feature-  and  an  area-based  operator  to 
the  intensity  image.  The  feature-based  operator  extracts 
the  edges  from  the  image,  thins  and  links  them  using  the 
technique  developed  by  Nevatia  and  Babu  [22j.  This  pro¬ 
cess  yields  both  the  edgels  linked  together  to  form  curves, 
and  the  linear  segments  abstracted  from  these  curves.  The 
second  operator  extracts  a  hierarchical  tree  of  regions  us¬ 
ing  the  Ohlander-Price- Reddy  algorithm  [21].  Since  we  are 
interested  in  regions  that  are  as  compact  and  convex  as 
possible,  we  further  process  the  regions  filtering  out  any 
regions  with  an  area  smaller  than  100  or  with  a  value  of 
perimeter7 /area  >  30  to  obtain  the  desired  regions,  which 
we  now  term  “blobs.” 

2.3.2  Associating  Segments  with  Blobs 

Next  each  segment  is  potentially  associated  with  a  blob  by 
searching  a  square  window  of  half-width  w  about  each  of 
the  segments  component  edgels  (see  Figure  7. 

A  count  of  the  edgels  within  the  search  area  is  main¬ 
tained  by  blob-id  and  whether  the  edge!  is  on  the  boundary 
or  in  the  interior  of  the  blob.  The  points  on  the  boundary 
are  weighted  with  a  value  of  2.5tu,  which  is  the  relative  im¬ 
portance  of  a  boundary  point  to  an  interior  point  in  the 
window.  The  total  weighted  points  are  summed  and  the 
segment  is  assigned  a  classification  by  the  percentage  of  the 
points  in  each  category.  The  following  is  a  typical  classifi¬ 
cation  which  indicates  that  it  is  very  likely  that  the  point  is 
along  the  boundary  of  blob  14  but  should  also  be  considered 
when  working  with  blob  1 1 . 


781 


( : CLASSIFICATION 

( : BOUNDARY  14  0.46) 

(:  BOUNDARY  11  0.29) 

(: INTERIOR  14  0.16) 

( : INTERIOR  11  0.10)) 

2.3.3  Locating  the  Endpoints  of  a  Gap 

A  search  is  made  near  the  endpoint  of  a  selected  segment 
and  a  boundary  point  is  selected.  The  boundary  is  followed 
in  the  direction  indicated  by  the  type  of  junction  and  the 
gradient  across  the  region  boundary.  An  endpoint  of  type 
Ei  (where  the  subscript  indicates  the  number  of  incoming 
edges)  indicates  that  the  search  should  check  for  an  outgo¬ 
ing  edge  while  following  the  boundary.  For  instance,  for  a 
region  with  a  light  interior  the  search  would  be  clockwise 
around  the  region.  Figure  8  shows  the  search  region  along 
the  boundary  of  some  example  regions. 

As  with  the  classification  of  segments  above,  we  use  a 
window  of  half-width  w  and  search  for  a  compatible  seg¬ 
ment  in  the  border-set  of  the  region.  For  the  Ei  junction 
example  above,  we  would  desire  an  E0,  S  (to  form  a  T- 
junction),  or  any  other  junction  which  needs  or  will  accept 
an  incoming  intensity  edge. 

2.3.4  Determination  of  the  Corner 

Now  that  the  endpoints  of  the  missing  corner  have  been 
located,  as  well  as  the  associated  section  of  the  blob  bound¬ 
ary.  We  must  determine  the  specific  path  of  the  corner.  For 
monocular  images  this  may  be  done  by: 

1.  Extending  the  chosen  curves  until  they  intersect,  for 
a  pair  of  curves  that  are  approximately  orthogonal, 
or 

2.  Following  the  smoothed  contour  of  the  blob  at  the 
distance  (1  —  t)  d\  +  t  dj  from  the  boundary,  where  t 
varies  from  0  to  I  along  the  gap,  and  d,  is  the  distance 
of  the  endpoints  from  the  blob. 

For  stereo  images,  there  is  an  additional  constraint  of 
desiring  a  smooth  transition  on  the  assumption  that  the 
gap  being  filled  represents  a  continuous  contour  of  an  object 
face  [23,6]. 

2.3.5  Results 

Figure  9  shows  some  of  the  strong  segments  in  each  of  the 
images.  The  segments  marked  as  being  very  likely  on  the 
boarder  of  the  center  blob  are  shown  with  darker  edgels  (the 
circles  mark  the  direction  of  the  edge  in  the  manner  of  an 
arrowhead).  Figure  10  shows  the  application  of  this  process 
to  a  gap  in  each  of  these  three  different  domains.  Note 
that  in  Figure  10  (a)  the  search  rejected  a  possible  segment 
because  it  could  not  form  a  reasonable  junction,  and  in 
(c)  the  vertical  segment  boarding  the  nearby  region  was 


rejected  in  preference  of  a  better  candidate  above,  because 
the  nearby  segment  was  more  likely  part  of  another  blob. 

In  combination  with  continuity  checking  [24,3],  filling 
the  gaps  by  combining  information  obtained  from  multiple 
sources  can  help  to  improve  the  matching  of  contour  edges. 
This,  in  turn  will  allows  the  detection  and  correction  of 
errors  in  stereo  matching.  The  improvement  in  exact  edgel 
matching  is  an  important  part  of  accurately  determining 
the  depth  of  points  in  3-D. 

2.4  3-D  Processing 

During  this  phase  of  the  processing  we  plan  to  use  the 
feature  primitives  along  with  their  2-D  matching  and  the 
camera  model  to  generate  a  set  of  points  in  3-D.  The  first 
sub-task  is  to  generate  the  detailed  matching  (at  the  pixel 
or  sub-pixel  level)  of  the  matched  features.  As  discussed 
below,  we  will  not  be  able  to  generate  all  of  the  matches 
initially.  Next  we  will  cluster  the  points  representing  fea¬ 
tures  that  seem  to  be  associated  with  individual  object  faces 
and  once  surfaces  have  been  determined,  they  will  be  used 
to  find  the  remaining  edgel  matches.  In  addition,  refine¬ 
ments  will  be  made  at  this  to  correct  for  distortion  caused 
by  vertically  oriented  limb  edges. 

2.4.1  Exact  Edgel  Matching 

Horizontal  disparities  carry  information  about  the  local 
depth  variations,  while  vertical  disparities  (because  they 
are  insensitive  to  depth)  can  be  used  to  extract  the  viewing 
system  parameters  of  angle  of  gaze  and  viewing  distance 
without  requiring  stronger  assumptions  about  the  nature 
of  the  system  from  which  they  originate. 

Currently  edges  are  extracted  at  approximately  pixel 
precision,  that  is,  each  edge  point  is  accurately  positioned 
to  within  |-pixel  of  its  actual  location.  The  li'ie  segments 
that  represent  approximately  linear  runs  of  edgels  vary  from 
the  assigned  edgel  locations  by  at  most  a  specified  maxi¬ 
mum.  We  value  long  reasonably  straight  runs  of  edgels  as 
a  useful  feature  for  stereo  matching  hut  the  errors  associ¬ 
ated  with  these  features  tire  to  great  for  the  derivation  of 
accurate  3-D  data.  To  get  the  accuracy  that  we  need  we 
first  note  that  when  we  match  segments  we  are  only  using 
the  segments  as  a  useful  approximation  to  a  linear  run  of 
edgels  so  it  is  the  edgels  that  we  want  to  match.  However, 
the  edgels  have  an  error  also.  Research  is  being  done  at 
USC  by  Huertas,  Medioni,  Chen,  and  Ulupinar  [25,26]  to 
obtain  edgels  to  sub-pixel  precision  and  we  plan  to  use  this 
technology  as  it  becomes  available.  In  the  meantime  we 
will  simulate  this  by  fitting  a  B-spline  curve  locally  to  the 
edgels  to  guarantee  that  we  have  a  Ci  smooth  curve  in  3-D 
when  we  match  the  curves. 

However  we  must  also  consider  those  cases  where  we 
cannot  assign  a  disparity  at  all  because  the  edge  is  horizon¬ 
tal  with  respect  to  the  epipolar  plane.  Such  edges  cannot  be 
assigned  a  disparity  because  stereo  correspondence  is  based 


782 


on  the  assumption  that  we  have  a  single  point  correspon¬ 
dence  in  the  epipolar-plane.  But  in  the  special  case  where 
the  edge  is  in  the  epipolar-plane  we  will  have  many  such 
points.  There  are  only  two  ways  that  we  can  assign  dispar¬ 
ity.  The  first  is  that  we  assume  a  linear  fit  and  we  match 
points  in  the  ratio  of  their  lengths.  For  some  scene  domains 
[2,27],  we  may  do  this  but  there  still  remains  the  problem 
of  missing  edges  or  extra  ones  due  to  noise.  If  we  do  not 
know  the  endpoints  then  we  may  not  match  the  points  at 
all. 

In  scenes  which  contain  curved  objects,  this  method  will 
not  work  because  the  matching  relationship  between  the 
points  on  the  two  segments  is  nonlinear.  In  fact  when  a 
curve  is  terminated  by  a  limb  (self  occluding)  contours,  such 
as  the  top  or  bottom  of  an  upright  cylinder,  we  do  not  know 
the  correct  end  points.  For  a  method  to  work  with  all  such 
scenes  then,  we  need  a  more  general  approach. 

2.4.2  Grouping  Segments  in  3-D 

The  one  that  we  propose  to  use  is  to  hypothesize  a  surface 
which: 

1.  Matches  the  vertical  segments. 

2.  Does  not  disagree  with  other  hypothesized  surfaces. 

3.  As  much  as  possible,  with  non-perfect  data,  incorpo¬ 
rates  the  observations  of  Lowe  and  Binford  [23]. 

This  surface  may  then  be  used  to  propagate  the  known 
disparities  at  reliable  contours  along  the  horizontal  con¬ 
tours,  and  along  areas  where  the  feature  extraction  and/or 
the  matching  was  incomplete.  In  addition  this  surface  serves 
as  a  hypothesis  of  a  dense  disparity  map  which  is  consistent 
with  the  observed  images. 

2.4.3  Edge  Analysis  and  Vertex  Detection 

Once  we  have  found  the  surfaces  making  up  the  faces  of 
the  objects  in  the  scene,  we  want  to  use  those  surfaces  to 
verify  the  the  edgel  matching  which  was  given  a  low  weight 
(such  as  those  along  horizontal  edges).  Also,  during  this 
sub-task,  we  label  the  edges  and  vertices.  To  do  this  the 
intersections  of  adjacent  object  faces  are  calculated.  Some 
adjustments  to  the  surfaces  may  be  necessary  to  preserve 
the  location  of  high-confidence  points  such  as  well  defined 
vertices.  Once  all  of  the  surfaces  have  been  found,  the  wire 
frame  is  re-analyzed  to  determine  the  edgel  matching  along 
horizontal  edges  in  the  scene,  which  could  not  be  given  a 
good  match  disparity  prior  to  the  surface  generation.  The 
contours  and  vertices  of  the  objects  in  the  scene  are  labeled 
for  use  by  higher-level  processes. 


2.4.4  Error  Analysis 

In  order  to  verify  the  accuracy  of  the  system,  and  to  help 
with  its  debugging,  we  plan  to  gather  scene  data  using  a 
new,  active  ranging  system,  developed  by  Jezouin  at  USC 
[28],  which  uses  a  pair  of  CCD  cameras  and  a  laser  to  find 
the  distance  from  the  left  camera  to  the  objects  in  the  scene. 
The  ranging  system  is  expected  to  give  ranges  with  the 
same  2-D  resolution  as  the  CCD  cameras.  In  addition,  the 
new  ranging  system  provides  a  camera  model  which  will  be 
used  to  register  the  images.  The  laser  is  equipped  with  a 
servo-  and  galvo-controlled  pair  of  mirrors  which  generates 
a  plane  of  light  which  may  be  swept  across  a  scene.  The 
intersection  of  the  light-plane  with  the  objects  in  the  scene 
is  reflected  as  a  thin  vertical  line  across  the  faces  of  the 
objects  in  the  scene. 

This  active  system  will  generate  a  ground-truth  for  the 
scene  that  will  be  used  during  the  error  analysis  to  provide 
a  quantitative  analysis  of  how  accurately  we  are  able  to 
determine  object  surfaces  from  the  passive  data  alone. 

3  Future  Research 

We  are  working  toward  a  passive  stereo  image-processing 
system  which  extracts  surfaces  from  matched  features.  In¬ 
variant  features  such  as  edges  and  corners  are  extracted 
from  monocular  images.  Edges  are  found  using  the  Nevatia- 
Babu  Linear  Feature  Extractor  [22]  and  matched  using  the 
Medioni  Segment  Matcher  [4].  The  resulting  segment 
matches  contain  matching  errors  and  the  extracted  features 
do  not  allow  for  the  trivial  extraction  of  disparity  from  the 
correct  matches.  This  system  will,  therefore,  post-process 
the  matched  segments  using  continuity  and  regional  context 
(such  as  the  gap-filling  described  in  this  paper)  to  detect 
and  correct  matching  errors.  Junctions,  found  by  analysis 
of  segment  joins  and  endpoints,  will  be  matched  using  the 
segment  matches  and  continuity  in  the  pair  of  images.  Hy¬ 
potheses  are  formed  about  missing  and  erroneous  lines  and 
junctions  which  will  be  acted  upon  if  verified  during  later 
processing. 

To  accurately  match  edgels  in  order  to  reduce  ranging 
error  as  well  as  remove  artificial  jaggedness  in  depth,  we 
feel  that  it  is  necessary  to  postulate  the  surfaces  of  the 
underlying  scene  and  therefore  produce  a  dense  description 
of  that  scene.  Therefore,  surfaces  will  be  fit  to  vertical 
lines.  This  last  stage  will  be  performed  in  two  steps,  first 
to  obtain  an  approximate  surface  sketch,  which  will  be  used 
to  determine  the  location  of  limb  edges  and  to  propagate 
disparity  information  across  the  horizontal  lines. 

The  principle  contributions  that  are  or  will  be  made  by 
this  system  are: 

1.  The  use  of  stereo  matching  to  drive  monocular  image 
processing. 

2.  A  new  classification  of  intensity  junctions  which  pro¬ 
vides  an  additional  monocular  cue  to  junction  valid- 


783 


ity. 

3.  A  solution  to  the  contrast-reversal  problem. 

4.  A  dense  representation  using  passive  techniques  with¬ 
out  making  any  “flat-surface”  assumption. 

5.  Develop  a  2  |-D  sketch  with  curved  surface-patches  fit 
to  an  extended  wire-frame  model  of  the  scene. 

Progress  has  already  been  made  on  the  first  three  of  the 
above  goals.  In  addition,  in  order  to  demonstrate  our  asser¬ 
tion  that  an  accurate  3-D  representation  may  be  acquired 
through  passive  stereo  we  will  obtain  an  actively-acquired 
set  of  range  data  which  is  registered  with  the  passively- 
acquired  intensity  images.  This  data  will  be  used  to  gener¬ 
ate  a  quantitative  measure  of  the  system. 

4  Acknowledgement 

The  author  would  like  to  thank  Gerard  Medioni,  Ramakant 
Nevatia,  and  Keith  Price  for  their  support  and  comments 
on  this  work. 

References 

[1]  A.  R.  de  Saint  Vincent.  A  3D  perception  system  for 
the  mobile  robot  Hilare.  In  IEEE  International  Con¬ 
ference  on  Robotics  &  Automation,  pages  1105-1111, 
IEEE  Computer  Society,  April  7-10  1986.  San  Fran¬ 
cisco,  California. 

[2]  S.  Tsuji,  J.  Zheng,  and  M.  Asada.  Stereo  vision  of 
a  mobile  robot:  World  constraints  for  image  match¬ 
ing  and  interpretation.  In  IEEE  International  Con¬ 
ference  on  Robotics  &  Automation,  pages  1594-1599, 
IEEE  Computer  Society,  April  7-10  1986.  San  Fran¬ 
cisco,  California. 

[3]  N.  Ayache  and  B.  Faverjon.  A  fast  stereovision 
matcher  based  on  prediction  and  recursive  verifica- 
.ion  of  hypothesis.  In  Proceedings  of  the  third  Work¬ 
shop  on  computer  Vision:  Representation  and  Control, 
pages  27-37,  IEEE  Computer  Society,  October  1985. 
Bellaire,  Michigan. 

[4]  G.  Medioni  and  R.  Nevatia.  Segment-based  stereo 
matching.  Computer  Vision,  Graphics,  and  Image 
Processing,  31:2-18,  1985. 

[5]  V.  Milenkovic  and  T.  Kanade.  Trinocular  vision  using 
photometric  and  edge  orientation  constraints.  In  Pro¬ 
ceedings  of  the  DARPA  Image  Understanding  Work¬ 
shop,  pages  163-175,  December  1985.  Miami  Beach, 
Florida. 


[6]  R.  Mohan.  Error  detection  and  correction  for  stereo. 
In  Proceedings  of  the  DARPA  Image  Understanding 
Workshop,  pages  433-442,  December  1985.  Miami 
Beach,  Florida. 

[7]  R.  D.  Arnold.  Automated  Stereo  Perception.  PhD  the¬ 
sis,  Stanford  University,  Stanford,  California,  March 
1983.  Technical  Report  STAN-CS-83-961. 

[8]  W.  Grimson.  From  Images  to  Surfaces:  A  Computa¬ 
tional  Study  of  the  Human  Early  Vitual  System.  Ar¬ 
tificial  Intelligence,  The  MIT  Press,  Cambridge,  Mas¬ 
sachusetts,  1981.  Based  on  the  author’s  thesis  (Ph.D. 
-  MIT). 

[9]  S.  Barnard  and  W.  Thompson.  Disparity  analysis  of 
:mages.  IEEE  Transactions  on  Pattern  Analysis  and 
Machine  Intelligence,  2(4):333-340,  July  1980. 

[10]  S.  Barnard  and  M.  Fischler.  Computational  stereo. 
ACM  Computing  Surveys,  14(4):553-572,  December 
1982. 

[11]  G.  Medioni  and  Y.  Yasumoto.  Corner  Detection  and 
Curve  Representation  Using  Cubic  B-Splines.  Techni¬ 
cal  Report  ISG  111,  University  of  Southern  California, 
September  1985. 

[12]  M.  Shah  and  R.  Jain.  Detecting  time-varying  cor¬ 
ners.  Computer  Vision,  Graphics,  and  Image  Process¬ 
ing,  28:345-355,  1984. 

[13]  D.  Langridge.  Curve  encoding  and  the  detection  of 
discontinuities.  Computer  Graphics  and  Image  Pro¬ 
cessing,  20:58-71,  1982. 

[14]  R.  Nevatia  and  A.  Huertas.  Building  Detection  in  Sim¬ 
ple  Scenes.  Final  Technical  Report  ISG  110,  Univer¬ 
sity  of  Southern  California,  September  1985. 

[15]  G  Falk.  Interpretation  of  imperfect  line  data  as  a 
three-dimensional  scene.  Artificial  Intelligence,  3:101- 
144,  1972. 

[16]  Y.  Shirai.  Recognition  of  real-world  objects  using  edge 
cues.  In  A.  Hanson  and  E.  Riseman,  editors,  Computer 
Vision  Systems,  pages  353-362,  Academic  Press,  New 
York,  1978. 

[17]  Y.  Shirai.  A  context  sensitive  line  finder  for  recog¬ 
nition  of  polyhedra.  Artificial  Intelligence,  4:95-119, 
1973. 

[18]  R.  Shapira  and  H.  Freeman.  Computer  description  of 
bodies  bounded  by  quadric  surfaces  from  a  set  of  im¬ 
perfect  projections.  IEEE  Transactions  on  Computers, 
C-27(9):84 1-854,  September  1978. 


704 


[19]  R.  Brooks.  Symbolic  Reasoning  Among  S-D  Models 
and  S-D  Images.  PhD  thesis,  Stanford,  Stanford,  Cal¬ 
ifornia,  June  1981.  Report  No.  STAN-CS-81-861. 

[20]  M.  Herman  and  T.  Kanade.  The  SD  MOSAIC 
Scene  Understanding  System:  Incremental  Recon¬ 
struction  of  SD  Scenes  from  Complex  Images.  Tech¬ 
nical  Report  CMU-CS-84-102,  Carnegie-Mellon  Uni¬ 
versity,  Pittsburgh,  PA,  February  1984. 

[21]  R.  Ohlander,  K.  Price,  and  R.  Reddy.  Picture  segmen¬ 
tation  using  a  recursive  splitting  method.  Computer 
Graphics  and  Image  Processing,  8:313-333,  1978. 

[22]  R.  Nevatia  and  K.R.  Babu.  Linear  feature  extraction 
and  description.  Computer  Graphics  and  Image  Pro¬ 
cessing,  13:257-269,  June  1980. 

[23]  D.  Lowe  and  T.  Binford.  The  recovery  of  three- 
dimensional  structure  from  image  curves.  IEEE  Trans¬ 
actions  on  Pattern  Analysis  and  Machine  Intelligence, 
7:320-326,  May  1985. 

[24]  S.  Cochran.  Accurate  determination  of  three  dimen¬ 
sional  surfaces  from  passive  stereo.  Unpublished,  De¬ 
cember  1986.  Ph.D.  Dissertation  Proposal. 

[25]  A.  Huertas  and  G.  Medioni.  Edge  detection  with  sub¬ 
pixel  precision.  In  Proceedings  of  Computer  Vision  and 
Pattern  Recognition  Conference,  pages  633-636,  IEEE 
Computer  Society,  June  19-23  1985.  San  Francisco, 
California. 

[26]  J.  S.  Chen,  A.  Huertas,  and  G.  Medioni.  Very  fast 
convolution  with  laplacian-of-gaussian  masks.  In  Pro¬ 
ceedings  of  Computer  Vision  and  Pattern  Recognition 
Conference,  pages  293-298,  IEEE  Computer  Society, 
June  22-26  1986.  Miami  Beach,  Florida. 

[27]  D.  Panton,  C.  Grosch,  D.  DeGryse,  J.  Ozils,  A 
LaBonte,  S.  Kaufmann,  and  L.  Kirvida.  Geometric 
Reference  Studies.  Final  Technical  Report  RADC-TR- 
81-182,  December  1981.  Volume  44,  Number  12. 

[28]  G.  Medioni  and  J.  Jezouin.  An  implementation  of  an 
active  stereo  range  finder.  March  1987.  To  be  pub¬ 
lished  in  the  Proceedings  of  the  Optical  Society  of 
America. 


785 


(c)  Blocks. 


Figure  1:  The  Original  Intensity  Images 


786 


e  3:  Possible  Join  Labels. 


(a)  Right  Image. 


(b)  Left  Image. 


(a)  Right  Image. 


(b)  l  eft  Image. 


i4|  **  { Bt,C 1} 
Cl  {/?(,  J?(} 
Ci  o  Cr 
Ar  { Br ,  Cr} 

C,  {Dr,  Er} 


At  «■  Ar 
Bi  <=>  Br 

D,  «•  Dr 

E,  £r 


Figure  4:  A  Partial  Match  Which  Can  be  propagated. 


j4j  Bj 

A;  O  A, 
B,  o  Cr 
Ar  -V.  Br 


Bi  o- 


Figure  5:  A  Bad  Match  Which  Can  be  Resolved. 


Figure  6:  Junction  Transition  Graph.  In  addition  to  the  the  “ALOS”  type,  the  corresponding  segments 
must  also  approximately  match. 


789 


Figure  7:  Search  for  Associated  Regions. 


Figure  8:  Search  for  Corner  Endpoints. 


A 

L 

0 

s 

A 

A 

A 

A 

A 

L 

A 

A 

O 

A 

L 

0 

A 

O 

o, 

A 

O 

S 

H 

121 

1 

A 

O 

B 

B 

Bl 

B 

B 

B 

B 

1 

Bl 

m 

l 

□ 

B 

L 

D 

D 

O 

Bl 

B 

B 

B 

B 

2 

B 

a 

2 

B 

B 

0 

H 

B 

D 

Bl 

B 

B 

B 

B 

1 

B 

Q 

1 

B 

B 

S 

m 

B 

B 

□1 

D 

B 

B 

B 

2 

B 

a 

2 

B 

B 

AAA 

n 

B 

B 

B 

ID 

B 

B 

B 

4 

D 

B 

mm 

B 

B 

AAL 

D 

B 

B 

B 

IB 

D 

B 

B 

3 

B 

□ 

a 

D 

B 

AAO 

D 

B 

B 

B 

IB 

B 

B 

B 

2 

B 

m 

3 

B 

B 

ALO 

D 

B 

B 

B 

IB 

B 

B 

B 

1 

B 

m 

2 

B 

B 

AOO! 

D 

B 

IB 

B 

IB 

B 

B 

B 

0 

B 

m 

2 

B 

D 

AOS 

D 

B 

B 

B 

IB 

B 

B 

B 

1 

B 

a 

1 

B 

B 

LLS 

B 

B 

B 

B 

IB 

D 

B 

B 

2 

B 

□ 

2 

B 

B 

'KS33S 

n 

IB 

IB 

IB 

IB 

ID 

IB 

IB 

B 

B 

B 

B 

LOO 

m 

IB 

IB 

IB 

IB 

ID 

IB 

IB 

B 

a 

B 

B 

OOO 

IB 

IB 

IB 

IB 

IB 

IB 

ID 

ID 

n 

B 

m 

B 

B 

Table  1:  Junction  Transition  Distances  (Assuming  a  unit  transition  cost). 


790 


Figure  9:  Segments  Near  Blob  Boundary. 


Figure  10:  Results  of  Applying  the  Gap-Filling. 


(b)  Aerial  View. 


(b)  Aerial  View. 


STEREO  MATCHING 

BY  HIERARCHICAL,  MICROCANONICAL  ANNEALING 

Stephen  T,  Barnard  * 


SRI  International  (Artificial  Intelligence  Center) 

333  Ravenswood  Avenue,  Menlo  Park,  California  94025 


Abstract 

An  improved  stochastic  stereo-matching  algorithm  is  pre¬ 
sented.  It  incorporates  two  substantial  modifications  to  an 
earlier  version:  (1)  a  new  variation  of  simulated  annealing 
that  is  faster,  simpler,  and  more  controllable  than  the  con¬ 
ventional  “heat-bath”  version,  and  (2)  a  hierarchical,  coarse- 
to-fine-resofution  control  structure.  The  same  Hamiltonian  is 
minimized,  but  far  more  efficiently.  The  basis  of  microcanonical 
annealing  is  the  Creutz  algorithm  .  Unlike  its  counterpart,  the 
familiar  Metropolis  algorithm,  the  Creutz  algorithm  simulates  a 
thermally  isolated  system  at  equilibrium.  The  hierarchical  con¬ 
trol  structure,  together  with  a  Brownian  state-transition  func¬ 
tion,  tracks  grounds  states  across  scale,  beginning  with  small, 
coarsely-coded  levels.  Results  for  a  512x512  pair  with  50  pixels 
of  disparity  are  shown. 

1  INTRODUCTION 

Computational  theories  of  vision  often  involve  optimization, 
usually  in  an  enormous  state  space  and  over  a  non-convex  ob¬ 
jective  function.  Recently,  a  subtle  technique  from  statistical 
physics  has  been  used  for  such  computations  [1,2, 3, 4).  It  is  based 
on  the  physical  analogy  of  annealing  a  system  of  molecules  to 
its  ground  state,  and  hence  is  called  simulated  annealing  [5,6]. 
To  grow  a  perfect  crystal,  one  starts  at  a  high  temperature  and 
then  gradually  cools  the  substance,  staying  as  close  to  equilib¬ 
rium  as  practical.  This  is  directly  analogous  to  what  happens 
in  simulated  annealing:  the  Metropolis  algorithm  [7]  is  used  to 
bring  a  synthetic  system  to  equilibrium,  and  the  macroscopic 
parameter  temperature  is  used  to  control  the  rate  of  cooling. 

The  general  theme  is  as  follows: 

•  A  particular  (usually  low-level)  vision  problem  is  consid¬ 
ered. 

•  A  representation  is  chosen  in  which  the  problem  is  modeled 
as  an  analog  to  a  physical  system  of  discrete  “molecules.” 
Each  molecule  typically  corresponds  to  a  location  on  a  pixel 
lattice.  The  system  has  many  degrees  of  freedom. 

•  An  energy  function  (a  Hamiltonian)  is  chosen  that  expresses 
constraints  inherent  in  the  problem.  The  solution  is  char¬ 
acterized  by  the  ground  states  of  the  system;  that  is,  those 

Support  for  this  work  was  provided  by  the  Defense  Advanced  Research 

Projects  Agency  under  contracts  DCA7S-85-C-0004  and  MDA903-86-C- 

0084. 


states  that  have  the  lowest  energy.  The  ground  states  are 
difficult  to  specify  because  the  state  space  of  the  system  is 
huge  —  exponential  in  the  number  of  molecules. 

e  The  dynamics  are  simulated  in  a  way  that  brings  the  system 
to  the  desired  ground  state,  or  at  least  to  a  very  low-energy 
state  that  approximates  a  ground  state. 

This  paper  presents  such  a  model  for  matching  stereo  im¬ 
ages.  An  early  version  is  described  in  [3].  Two  major  exten¬ 
sions  have  been  made  that  permit  substantially  improved  perfor¬ 
mance.  First,  a  new  variety  of  simulated  annealing  is  employed. 
It  uses  an  simpler  alternative  to  the  standard  Metropolis  al¬ 
gorithm  which  is  more  efficient,  more  easily  implemented,  and 
offers  more  control  over  the  annealing  process.  Secondly,  by  rep¬ 
resenting  the  stereo  pair  as  a  Laplacian  pyramid,  the  system  is 
extended  to  operate  over  several  levels  of  resolution.  It  exploits 
relatively  quickly-computed  minima  at  lower  levels  of  resolution 
to  initialize  its  state  at  higher  levels.  This  method  leads  both 
to  more  efficiency  and  to  the  ability  to  deal  with  much  larger 
ranges  of  disparities. 

The  basic  representation  remains  unchanged.  The  state  of  the 
system  encodes  a  dense  map  of  discrete  horizontal  disparities, 
defined  over  the  left  image,  which  specify  corresponding  points 
in  the  right  image.  In  the  improved  design,  this  state  is  relative 
to  a  particular  level  of  resolution. 

The  energy  of  a  lattice  site  is  composed  of  two  terms,  each  of 
which  expresses  a  contraint  important  in  stereo  matching: 

Eij  =  \rL(i,j)  -  IR(i,j  +  D(i,i))|  +  A|VZ>(s,j)| 

II  &nd  IR  are  the  left  and  right  image  intensities,  or  some  simple 
function  of  them  (such  as  a  Laplacian).  Subscripts  t  and  j  range 
over  all  sites  in  the  left  image  lattice.  D  is  the  disparity  map. 
The  first  term  is  the  absolute  difference  in  intensity  between 
corresponding  points  (where  the  correspondences  are  defined  by 
the  current  state  of  the  system),  and  the  second  is  proportional 
to  the  local  spatial  variation  in  the  current  state  (the  magnitude 
of  the  gradient  of  the  disparity  map).  The  constant  A  is  used 
to  balance  the  terms.1  This  equation  expresses  two  competing 
constraints:  (1)  the  image  intensities  of  corresponding  points 
should  be  more-or-less  equal,  and  (2)  the  disparity  map  should 
be  more-or-less  continuous. 

The  Hamiltonian, 

_ B  =  £  , 

'The  performance  of  the  system  is  not  very  sensitive  to  the  value  chosen 

for  A.  In  the  example  of  Section  4,  as  well  as  all  three  examples  is  [3), 

A  =  5  was  used. 


792 


is  unchanged;  therefore,  the  ground  states  that  we  seek  to  deter¬ 
mine  or  to  approximate  are  the  same.  The  new  design  is  strictly 
concerned  with  improved  performance.  The  techniques  used  to 
obtain  it  are  not  restricted  to  stereo  matching,  and  should  be 
considered  for  any  simulated-annealing  approach  to  low-level  vi¬ 
sion  problems. 

2  MICROCANONICAL  ANNEALING 

The  conventional  simulated  annealing  algorithm  uses  an 
adaptation  of  the  Metropolis  algorithm  to  bring  a  system  to 
equilibrium  at  decreasing  temperatures.  The  Metropolis  algo¬ 
rithm  defines  a  Markov  process  that  generates  a  sequence  of 
states,  such  that  the  probability  of  occurrence  of  any  particular 
state  is  proportional  to  its  Boltzman  weight: 

P(S)  «  exp (-0E{S))  , 

where  E{S)  is  the  energy  of  state  5  and  0  is  the  inverse  tem¬ 
perature  of  the  system.  This  process  generates  samples  from 
the  canonical  ensemble-,  that  is,  the  system  is  considered  to  be 
immersed  in  a  heat  bath  with  a  controllable  temperature.  An¬ 
nealing  is  accomplished  by  imposing  a  schedule  for  reducing  the 
temperature,  1/0,  so  as  to  keep  the  system  close  to  equilibrium. 

Creutz  has  described  an  alternative  technique  that  simulates 
the  microcanonical  ensemble  [8].  In  this  method,  the  total  energy 
remains  constant  (for  some  fixed  point  in  the  schedule).  Instead 
of  simulating  a  system  immersed  in  a  heat  bath,  the  Creutz 
algorithm  simulates  a  thermally  isolated  system  in  which  energy 
is  conserved.  It  performs  a  random  walk  through  state  space, 
constraining  states  to  a  surface  of  constant  energy.  The  simplest 
way  to  accomplish  this  is  to  augment  the  representation  with 
an  additional  degree  of  freedom,  called  a  demon,  that  carries  a 
variable  amount  of  energy,  Ed-  The  total  energy  of  the  system 
is  now: 

E  —  E(S)  +  Ed  - 

Normally,  the  demon  is  constrained  to  have  non-negative  energy, 
although  the  possibility  of  giving  it  negative  energy  is  useful  in 
annealing,  as  will  be  discussed  below. 

In  the  Metropolis  algorithm  a  potential  new  state  S'  is  chosen 
randomly,  and  is  accepted  or  rejected  based  on  the  change  in 
energy: 

A E  =  E(S')  -  E(S )  . 

If  A E  is  negative,  the  new  state  is  accepted;  otherwise,  it  is 
accepted  with  probability  exp(-0A E). 

The  Creutz  algorithm  is  quite  similar.  If  A  E  is  negative,  the 
new  state  is  accepted,  and  the  demon  energy  is  increased  {Ed  *— 
Ed  -  A E).  If  A E  is  non-negative,  however,  acceptance  of  the 
new  state  is  contingent  upon  Ed-  if  A E  <  Ed  the  change  is 
accepted,  and  the  demon  energy  is  decreased  {Ed  «—  Ep  -  A E)\ 
otherwise,  the  new  state  is  rejected  Clearly,  the  total  energy  of 
the  system  remains  constant. 

The  Creutz  algorithm  has  several  advantages.  Unlike  the 
Metropolis  algorithm,  it  does  not  require  the  evaluation  of  tran¬ 
scendental  functions.  Of  course,  in  practice  these  functions  can 
be  stored  as  tables,  but  we  would  like  our  algorithm  to  be  adapt¬ 
able  to  fine-grained  parallel  processors  such  as  the  Connection 
Machine.  The  small  amount  of  local  memory  in  such  machines 
makes  lookup  tables  unattractive.  The  Creutz  algorithm  can 
easily  be  implemented  with  only  integer  arithmetic  —  again,  a 


significant  advantage  for  fine-grained  parallel  processors  and  for 
VLSI  implementation.  Experiments  indicate  that  the  Creutz 
method  can  be  programmed  to  run  an  order  of  magnitude  faster 
than  the  conventional  Metropolis  method  for  discrete  systems 
[9] .  A  further  important  advantage  is  that  microcanonical  sim¬ 
ulation  does  not  require  high  quality  random  numbers. 

In  conventional  simulated  annealing  we  control  the  process  by 
specifying  the  temperature.  In  the  microcanonical  version,  how¬ 
ever,  temperature  is  not  a  control  parameter;  it  is  a  statistical 
feature  of  the  system.  In  fact,  standard  arguments  can  be  used 
to  show  that  at  equilibrium  the  demon  energies  have  a  Boltzman 
distribution: 

P{Ep)  a  exp {-0Ed)  ■ 

The  inverse  temperature  can  be  determined  from  the  mean  value 
of  the  demon  energy: 

„  ln(l  +  4 /  <  Ed  >) 

0  = - - -  . 

Control  of  microcanonical  annealing  is  accomplished  by  peri¬ 
odically  removing  energy  from  the  system.  The  method  used  to 
generate  the  results  in  Section  4  is  as  follows: 

1.  Assume  that  the  process  begins  in  a  random  state  So  of 
high  energy  with  respect  to  the  ground  state.  Ceill  this 
Eo  =  E{So).  The  initial  demon  energy  is  zero. 

2.  Remove  a  fixed  proportion  of  this  energy,  5 E,  by  reducing 
the  demon  energy  {Ed  *—  Ep  -  6E).  The  results  of  Section 
4  were  obtained  with  SE  =  E0/ 300. 

3.  Run  the  Creutz  algorithm  until  the  system  reaches  equilib¬ 
rium.  A  reasonable  test  for  equilibrium  is  to  consider  the 
rate  of  accepted  moves  to  states  of  higher  energy.  If  the 
system  is  large  enough,  this  rate  will  increase  steadily  until 
it  approximates  the  rate  of  moves  to  states  of  lower  energy. 
We  terminate  the  process  for  a  particular  energy  level  when 
the  rate  of  accepted  moves  to  higher  energy  states  decreases 
(measured  over  N  site  visits. 

4.  Repeat  steps  (2)  and  (3)  until  no  further  improvement  is 
observed. 

This  procedure  has  only  one  free  parameter:  the  ratio  BE/ Ed- 
While  Ed  <  0  the  Creutz  method  operates  as  a  “greedy”  algo¬ 
rithm,  accepting  only  moves  to  lower  energy  states  and  increas¬ 
ing  Ed-  When  Ep  becomes  positive  (which  happens  quickly  if 
SE  is  small),  the  demon  begins  to  exchange  energy  between  lat¬ 
tice  positions.  As  the  ground  state  is  approached,  Ed  remains 
negative  because  it  cannot  absorb  more  energy  from  the  lattice. 

If  sites  are  visited  in  a  regular  scan,  we  observe  an  undesireable 
effect:  after  Ed  is  made  negative,  energy  is  removed  from  only  a 
small  area  of  the  lattice.  For  example,  if  we  visit  the  sites  along 
scan  lines  from  left-to-right,  bottom-to-top,  the  algorithm  will 
make  “greedy”  moves  on  the  lower  part  of  the  lattice,  resulting 
in  significantly  lower  energy  density  in  this  area  compared  to  the 
rest  of  the  lattice.  Many  more  scans  may  be  required  for  this 
energy  gradient  to  diffuse  thoughout  the  entire  lattice.  This 
problem  is  easily  overcome  by  visiting  sites  in  random  order. 

The  algorithm  described  above  is  sequential,  and  therefore  is 
not  suitable  for  parallel  processing.  Each  local  state  transition 
can  change  Ep,  which  in  turn  can  affect  the  next  transition. 
Fortunately,  the  technique  can  be  modified  to  a  parallel  one 


793 


by  using  a  separate  demon  for  each  lattice  site.  (As  we  add 
demons,  the  technique  moves  toward  a  canonical-ensemble  sim¬ 
ulation.  In  fact,  if  the  number  of  demons  is  very  large  compared 
to  the  number  of  sites,  the  technique  specializes  to  the  Metropo¬ 
lis  algorithm  [8].)  Preliminary  experiments  with  one  demon  per 
site  indicate  good  performance,  although  the  results  of  Section 
4  were  generated  with  the  single-demon  algorithm. 

3  HIERARCHICAL  ANNEALING 


pixel,  with  truncation,  at  the  (n-  4)th  level  )  We  shall  start  an¬ 
nealing  at  this  level,  find  an  approximate  ground  state,  at  then 
expand  the  solution  to  the  next  level.  To  make  this  coarse-to- 
fine  strategy  work,  however,  we  need  two  further  modifications. 
We  must  use  a  different  process  for  generating  state  transitions, 
and  we  must  specify  how  a  low-resolution  result  is  used  to  start 
the  annealing  process  at  the  next  higher  level. 

In  the  original  model,  the  probability  of  choosing  a  new  dis¬ 
parity  for  consideration  as  a  new  state  was  uniformly  distributed 
over  the  prior  range  of  disparities: 


In  the  original  model  [3],  annealing  was  performed  only  at 
the  level  of  resolution  of  the  stereo  images.  In  some  cases  the 
images  were  first  bandpass  filtered  to  remove  low-frequency  com¬ 
ponents  —  in  effect,  a  simple  photometric  correction  for  inconsis¬ 
tent  sensor  gains  or  film  development.  Lower  and  upper  bounds 
on  disparity  were  specified  in  advance,  and  all  state  transitions 
were  considered  with  equal  probability.  As  the  range  of  dispar¬ 
ity  became  large,  this  scheme  required  much  larger  amounts  of 
computation.  If  we  have  m  permissible  disparities  and  N  pixels, 
the  size  of  the  state  space  is  mN .  If  we  double  the  range  of 
disparity,  the  size  of  the  state  space  increases  by  a  factor  of  1N . 
Since  N  is  a  rather  large  number  (218  in  the  example  shown  in 
Section  4),  we  see  that  the  state  space  grows  explosively  with 
increasing  disparity  range. 

A  natural  extension  of  the  method  is  to  adopt  a  hierarchical, 
coarse-to-fine  control  structure.  At  a  coarse  level  of  resolution 
the  number  of  lattice  sites  (i.e.,  the  number  of  pixels)  and  the 
range  of  disparity  are  small,  and  therefore  the  size  of  the  state 
space  is  relatively  small.2  We  should  be  able  to  compute  an 
approximate  ground  state  quickly,  and  then  use  it  to  initialize 
the  annealing  process  at  the  next,  finer  level  of  resolution. 

Coarse-to-fine  techniques  have  been  widely  used  for  image 
matching.  For  example,  Moravec  [10]  used  a  resolution  hierar¬ 
chy  to  match  discrete,  point-like  features.  The  stereo  model  de¬ 
scribed  by  Marr  and  Poggio  [11]  and  further  developed  by  Grim- 
son  [12)  matched  zero-crossings  between  hierarchies  of  band- 
passed  images.  More  recently,  Witkin  et.  al.  [13]  have  used 
a  continuation  method  to  minimize  an  energy  function  through 
scale  space.  In  each  case,  the  results  of  low-resolution  matching 
were  used  to  guide  the  system  at  higher  resolutions. 

The  Laplacian  pyramid,  developed  originally  as  a  compact 
image  coding  technique  [14],  offers  an  efficient  representation 
for  hierarchical  annealing.  In  a  Laplacian  pyramid,  an  im¬ 
age  is  transformed  into  a  sequence  of  bandpass  filtered  copies, 
1°,  /',  I2, ...,  In,  each  of  which  is  smaller  that  its  predecessor  by 
a  factor  of  1/2  in  linear  dimension  (a  factor  of  1/4  in  area),  with 
the  center  frequency  of  the  passband  reduced  by  one  octave. 
This  transform  can  be  computed  efficiently  by  recursively  apply¬ 
ing  a  small  generating  kernel  to  create  a  Gaussian  (low-passed) 
pyramid,  and  then  differencing  successive  low-passed  images  to 
construct  the  Laplacian  pyramid. 

After  constructing  Laplacian  pyramids  from  the  original 
stereo  images,  disparity  is  reduced  by  a  factor  of  1/2  in  suc¬ 
cessive  levels.  Therefore,  at  some  level,  disparity  is  small  every¬ 
where.  For  typical  stereo  images,  we  can  take  this  to  be  level 
n  -  4.  (For  example,  if  the  original  images  were  a  power  of  2 
in  linear  dimension,  the  Laplacian  images  at  level  n  -  4  would 
be  16x16  pixels.  Disparities  in  the  range  of  0  to  63  pixels  in  a 
pair  of  512x512  images  would  be  reduced  to  the  range  of  0  to  1 

’Although  the  state  apace  may  be  large  in  absolute  terms. 


This  is  not  compatible  with  our  intention  of  guiding  the  process 
with  lower-resolution  results,  however.  We  are  now  assuming 
the  disparity  of  a  lattice  site  to  be  close  to  its  correct  value,  A 
more  effective  generating  process  is  to  restrict  the  disparities  to 
increase  or  decrease  by  one  pixel: 


if  14  -d*|  =  i 

otherwise 


with  the  further  restriction  that  the  disparities  are  not  allowed 
to  specify  corresponding  points  outside  the  boundary  of  the 
right  image.  In  this  scheme  the  system  undergoes  Brownian 
motion  through  state  space.  An  additional  feature  of  this  state- 
transition  function  is  that  it  is  no  longer  necessary  to  specify 
bounds  on  disparity  in  advance. 

Expanding  a  low-resolution  result  to  the  next  level  is  a  little 
tricky.  Obviously,  one  should  begin  by  simply  doubling  the  size 
of  the  low-resolution  lattice,  and  also  doubling  the  disparity  val¬ 
ues.  Having  done  this,  however,  the  new  state  has  an  artificially 
low  energy  because  every  odd  disparity  value  is  “unoccupied,” 
and  the  new  map  is  therefore  more  uniform  that  it  should  be. 
A  spurious  symmetry  is  imposed  on  the  new  state  that  is  Bolely 
due  to  the  quantization  of  the  previous  result,  and  that  is  likely 
to  place  the  system  near  a  metastable  state  (a  local  minimum) 
from  which  it  cannot  recover.  Fortunately,  there  is  an  easy  solu¬ 
tion  to  this  problem:  break  this  symmetry  by  adding  heat.  One 
effective  way  is  as  follows: 


1.  Compute  the  energy  £jj  of  the  initial  state  at  level  k .  (This 
state  has  been  determined  by  doubling  the  result  at  level 
k+1.) 

2.  Add  a  fixed  proportion  of  this  energy,  &Ek ,  by  increasing 
the  demon  temperature  (Ed  <—  Ed  4-  A Ek).  The  results  of 
Section  4  were  obtained  with  A Ek  =  Ek / 10. 

3.  Run  the  Creutz  algorithm  until  the  system  reaches  equilib¬ 
rium. 

4.  Repeat  steps  (2)  and  (3)  until  the  number  of  accepted 
higher-energy  states  exceeds  the  number  of  rejected  higher- 
energy  states. 

4  RESULTS 


Figures  1(a)  and  (b)  are  an  aerial  stereo  pair  (512x512  pix¬ 
els  each)  covering  part  of  Martin  Marietta’s  test  site  for  the 
Autonomous  Land  Vehicle  Project  near  Denver.  The  terrain 
is  dominated  by  a  long,  steep  “hogback”  (running  diagonally 


794 


across  the  middle  of  the  images),  that  separates  two  broad  val¬ 
leys.  The  highest  terrain  is  in  the  upper  left.  Several  roads  and 
a  few  buildings  may  be  seen.  Disparity  ranges  from  zero  to  50 
pixels.  Figures  1(c)  and  (d)  show  the  Laplacian  pyramids. 

The  results  of  applying  the  microcanonical,  hierarchical  an¬ 
nealing  algorithm  to  this  data  are  illustrated  in  Figure  2(a), 
which  shows  the  sequence  of  approximate  ground  states  found 
at  each  level  of  the  pyramid.  Disparity  values  are  displayed  in 
15  greylevels  with  wraparound.  At  the  coarsest  level  of  resolu¬ 
tion  (the  tiny  16x16  images),  the  system  converged  to  a  uniform 
disparity  map.  As  the  system  descends  through  the  resolution 
hierarchy,  the  detail  of  the  approximate  ground  states  becomes 
increasingly  precise.  The  disparities  along  the  right  border  of 
the  map  are  incorrect  because  the  left  image  does  not  overlap 
the  right  in  this  region,  and  pixels  in  this  region  therefore  have 
no  corresponding  points.  Figure  2(b)  is  a  contour  map  of  cov¬ 
ering  the  upper  left  portion  of  the  stereo  pair,  prepared  by  the 
Engineering  Topographic  Laboratory,  and  is  included  for  com¬ 
parison. 

Figures  3  and  4  show  some  of  the  results  in  more  detail.  In 
Figure  3(a)  a  few  points  and  columns  have  been  selected  from 
a  magnified  portion  of  the  left  image.  The  corresponding  points 
in  the  right  image  are  shown  in  Figure  3(b).  Figure  4  is  similar. 

The  largest  problem  solved  by  the  previous  version  of  the  sys¬ 
tem  was  on  the  order  of  a  256x256  pair  with  15  levels  of  disparity. 
This  required  about  12  hours  on  a  Symbolics  3600.  The  result 
shown  here  is  a  512x512  pair  with  50  levels  of  disparity,  and  was 
computed  in  about  the  same  amount  of  time.  Consider  the  size 
of  the  state  space  of  the  two  problems.  The  state  space  of  the 
current  problem3  exceeds  the  size  of  the  old  problem  by  a  factor 
of  more  than  lo36000. 

5  CONCLUSIONS 

Two  major  improvements  to  a  stochastic  stereo-matching  sys¬ 
tem  have  been  described:  (1)  a  simulation  of  the  microcanonical 
ensemble,  as  opposed  to  the  canonical  ensemble  of  conventional 
simulated  annealing;  and  (2)  the  extension  to  a  hierarchical  con¬ 
trol  structure  based  on  Laplacian  pyramids.  Both  techniques 
-.re  rather  general  in  nature,  and  can  probably  be  used  in  other 
simulated-annealing  applications.  Together,  they  permit  solu¬ 
tions  of  stereo-matching  problems  far  beyond  the  competence  of 
the  original  system. 

We  are  currently  evaluating  the  quantitative  increase  in  per¬ 
formance  attributable  to  each  new  technique.  Qualitatively,  the 
use  of  microcanonical  annealing  yields  perhaps  an  order  of  mag¬ 
nitude  increase  in  efficiency.  A  potentially  more  important  ben¬ 
efit  is  that  it  is  a  simpler  computation  than  standard  annealing, 
and  is  therefore  more  readily  implemented  in  fine-grained  paral¬ 
lel  systems.  The  hierarchical  control  structure,  combined  with 
the  Brownian  state-transition  function,  contributes  most  of  the 
increased  performance. 


[2]  P.  Carnevali,  L  Coletti,  and  S.  Patarnello,  Image  process¬ 
ing  by  simulated  annealing,  IBM  Journal  of  Research  and 
Development,  vol.  26,  no.  6,  November  1985,  pp.  569-579. 

[3]  S.  Barnard,  A  stochastic  approach  to  stereo  vision,  Proc. 
5th  Nat.  Conf.  Artificial  Intel l.  (Philadelphia,  Aug. 
1986),  vol.  1 ,  American  Association  of  Artificial  Intelli¬ 
gence,  Menlo  Park,  Calif.,  pp.  676-680. 

[4]  J.  L.  Marroquin,  Probabilistic  solution  of  inverse  problems, 
Technical  Report  860,  MIT  Artificial  Intelligence  Labora¬ 
tory,  September,  1985. 

jo)  S.  Kirkpatrick,  C.D.  Gelatt,  and  M.P.  Vecchi,  Optimization 
by  simulated  annealing,  Science,  vol.  220,  no.  4598,  May 
13,  1983,  pp.  671-680. 

[6]  S.  Kirkpatrick  and  R.H.  Swendsen,  Statistical  mechanics 
and  disordered  systems,  Comm.  ACM,  vol.  28,  no.  4, 
April  1985,  pp.  363-373. 

(7)  N.  Metropolis,  A.W.  Rosenbluth,  M.N.  Rosenbluth,  A.H. 
Teller,  and  E.  Teller,  Equations  of  state  calculations  by  fast 
computing  machines,  J.  Chem.  Phys.,  vol.  21,  no.  6,  June 
1953,  pp.  1087-1092. 

[8]  M.  Creutz,  Microcanonical  Monte  Carlo  simulation,  Phys¬ 
ical  Review  Letters,  vol.  50,  no.  19,  May  9,  1983,  pp. 
1411-1414. 

(9)  G.  Bhanot,  M.  Creutz,  and  H.  Neuberger,  Microcanonical 
simulation  of  Ising  systems,  Nuclear  Physics,  B235[FSll], 
1984,  pp.  417-434. 

[10]  H.  Moravec,  Rover  visual  obstacle  avoidance,  Proc.  7th 
Int.  Jt.  Conf.  Artficial  Intel l.  (Vancouver,  Canada,  Aug. 
1981),  vol.  2,  American  Association  of  Artificial  Intelli¬ 
gence,  Menlo  Park,  Calif.,  pp.  785-790. 

[11]  D.  Marr  and  T.  Poggio,  A  theory  of  human  stereo  vi¬ 
sion,  Memo  451,  Artificial  Intelligence  Laboratory,  Mas¬ 
sachusetts  Institute  of  Technology,  Cambridge,  Mass.,  Nov. 
1977. 

[12]  W.  E.  L.  Grimson,  From  Images  to  Surfaces,  M.I.T.  Press, 
Cambridge,  Mass.,  1981. 

[13]  A.  Witkin,  D.  Terzopoulos,  and  M.  Kass,  Signal  matching 
through  scale  space,  Proc.  5th  Nat.  Conf.  Artificial  Intell. 
(Philadelphia,  Aug.  1986),  vol.  1,  American  Association  of 
Artificial  Intelligence,  Menlo  Park,  Calif.,  pp.  714-719. 

[14]  P.  Burt,  The  Laplacian  pyramid  as  a  compact  image  code, 
IEEE  Transactions  on  Communications,  vol.  COM-31,  no. 
4,  April  1983,  pp.  532-540. 


References 

[1]  S.  Geman  and  D.  Geman,  Stochastic  relaxation,  Gibbs 
distributions,  and  Bayesian  restoration  of  images,  IEEE 
Transactions  on  Pattern  Analysis  and  Machine  Intelligence, 
vol.  PAMI-6,  nO.  6,  November,  1984,  pp.  721-741. 

’Considering  ths  rings  of  disparity  to  bs  fixed  at  SO  pixels. 


795 


(a)  Pyramid  of  approximate  ground  states. 


(b)  Contour  plot  covering  upper  left  portion. 
Figure  2:  Approximate  ground  states. 


(a)  Left  image  detail. 


(a)  Left  image  detail. 


(b)  Corresponding  right  image  detail.  (b)  Corresponding  right  image  detail. 


Figure  3:  Detail  A. 


Figure  4:  Detail  B. 


797 


THE  FORMATION  OF  PARTIAL  3D  MODELS  FROM  2D  PROJECTIONS 
-  AN  APPLICATION  OF  ALGEBRAIC  REASONING' 


D.  Cyrluk,  D.  Kapur,  J.L.  Mundv,  and  V.  Nguyen 


General  Electric  Corporate  Research  and  Development 
P.O.  Box  8, 

Schenectady,  New  York  12301 


ABSTRACT 

An  algebraic  model  for  three-dimensional  objects  is  intro¬ 
duced.  Algebraic  deduction  with  the  Grobner  basis  is  used 
to  verify  consistency  between  two  object  views.  It  is  demon¬ 
strated  that  a  useful  model  can  be  extracted  from  a  single 
viewpoint  using  a  new  approach  to  scene  labeling. 

1.  INTRODUCTION 
Model-Based  Recognition 

Considerable  research  effort  has  been  applied  to  the  use 
of  explicit  three-dimensional  models  for  object  recognition 
iBesl  and  Jain].  These  models  have  been  frequently  derived 
using  conventional  CAD  tools  [Roberts,  Brooks,  Lowe].  It  is 
also  possible  to  derive  three-dimensional  models  from  ideal 
line  drawings  [Markowsky  and  Wesley,  Strat], 

Models  can  also  be  derived  from  range  data  [Faugeras 
and  Hebert,  Grimson  and  Lozano-Perez,  Connolly  et  al]  or 
from  stereo  data  [Faugeras  et  al.,  Mayhew],  In  these  cases, 
the  models  have  been  used  to  recognize  objects  as  imaged  in 
the  same  fashion.  That  is,  a  model  of  planar  surfaces 
derived  from  range  data  is  matched  to  planar  surfaces 
derived  from  another  range  view  of  the  object.  Likewise, 
edge  segments  derived  from  stereo  data  are  matched  to 
edges  derived  from  another  stereo  view  of  the  object. 

Constraints  From  a  Single  Projection 

The  focus  of  this  paper  is  the  derivation  of  a  partial 
model  from  a  single  two-dimensional  perspective  view  of  an 
object  and  the  use  of  this  model  to  recognize  the  object  from 
another  viewpoint.  It  is  not  possible  to  derive  a  complete, 
three-dimensional  model  from  a  single  view  without  assum¬ 
ing  a  great  deal  about  the  relationships  between  object  sur¬ 
faces  and  edges.  For  example,  a  common  assumption  is  that 
the  object  is  a  polyhedron  with  adjacent  faces  and  edges  per¬ 
pendicular.  With  this  assumption,  it  is  possible  to  establish 
the  three-dimensional  structure  of  the  visible  surfaces  from  a 
single  view  [Herman]. 

The  investigation  here  centers  on  the  case  where  the  only 
assumption  made  about  the  object  is  that  it  is  a  polyhedron. 
The  initial  experiments  have  assumed  that  the  perspective 
image  transformation  can  be  approximated  by  an  affine 

•This  work  was  supported  in  part  by  ihe  DARPA  Strategic  Computing  Vision  Pro¬ 
gram  in  conjunction  with  the  Army  Engineer  Topographic  Laboratories  under  Con¬ 
tract  No.  DACA76-86-C-0007. 


transformation.  An  affine  transformation  is  an  orthographic 
projection  with  subsequent  scaling  of  coordinate  axes.  The 
affine  approximation  is  more  realistic  than  the  usual  ortho¬ 
graphic  assumption,  but  is  more  constraining  than  the  full 
perspective  case. 

A  polyhedron  is  a  collection  of  three  elements;  the  ver¬ 
tex,  the  edge,  and  the  face.  A  vertex  is  a  three-dimensional 
point  and  represents  the  intersection  of  two  or  more  edges  or 
three  or  more  faces.  An  edge  is  bounded  by  two  vertices 
and  is  the  intersection  of  at  least  two  faces.  A  face  is  a  planar 
region  bounded  by  a  sequence  of  edges  and  vertices.  The 
affine  projection  of  a  polyhedron  produces  a  set  of  two- 
dimensional  edges  and  vertices.  The  image  plane  is  parti¬ 
tioned  into  a  set  of  closed  regions,  which  are  bounded  either 
by  projections  of  edges  and  vertices  or  the  image  border. 

The  least  restrictive  constraint  that  can  be  derived  from 
the  projection  is  simply  that  the  image  vertices  are  related  to 
the  corresponding  object  vertices  by  the  linear  transformation 
equations  where  the  three-dimensional  coordinates  of  each 
vertex  are  unknown,  along  with  the  six  affine  transformation 
parameters.  These  constraints  are  not  sufficient  to  usefully 
define  the  three-dimensional  structure  of  the  object.  It  is 
necessary  to  introduce  a  grouping  of  vertices,  edges  and  faces 
of  the  projection  to  provide  significant  constraints  on  the 
three-dimensional  object  configuration. 

One  example  of  such  a  group  is  to  identify  a  cycle  of 
edges  and  vertices  in  the  projection  image  that  corresponds 
to  a  visible  face  of  the  three-dimensional  polyhedron.  The 
identification  of  such  groups  requires  that  a  labeling  opera¬ 
tion  be  performed  on  the  projection  [Malik,  Kapur  et  al, 
Nguyen].  The  fact  that  the  cycle  elements  must  be  coplanar 
can  then  be  used  to  further  constrain  the  three-dimensional 
structure.  Excellent  reviews  on  the  nature  of  these  projec¬ 
tion  and  coplanarity  constraints  are  available  [Sugihara,  Cra- 
po]. 

Recognition  as  Algebraic  Consistency 

Once  the  set  of  constraints  is  established  from  a  given 
view  of  the  object,  they  can  be  used  as  a  model  for  recogni¬ 
tion.  in  the  work  described  here,  the  model  constraints  are 
expressed  as  a  set  of  algebraic  equations  in  terms  of  un¬ 
known  three-dimensional  coordinates  and  transformation  pa¬ 
rameters.  If  we  consider  another  two-dimensional  projection, 
which  is  hypothesized  to  be  another  view  of  the  object,  then 
the  equations  derived  from  each  projection  should  be  alge¬ 
braically  consistent;  of  course,  the  correct  assignment  has  to 
be  made  between  the  corresponding  elements  of  each  view. 


708 


2.  IMAGF.S  OF  POI.YHFDR  A 


The  use  of  feasible  constraints  for  object  recognition  is 
motivated  by  the  ACRONYM  system  [Brooks],  In  the  case 
of  ACRONYM,  the  consistency  between  a  known  three- 
dimensional  model  and  its  projection  in  an  image  was  tested 
using  a  form  of  the  SUP-INF  method  of  reasoning  about 
linear  inequalities.  The  inequalities  arise  because  of  toler¬ 
ances  on  the  expected  transformation  and  image  feature  posi¬ 
tion.  By  contrast,  in  the  work  reported  here,  we  do  not  have 
a  three-dimensional  model,  but  only  the  constraints  imposed 
by  a  two-dimensional  view.  Another  major  difference  is  that 
the  decision  procedure  used  in  the  current  work  permits  non¬ 
linear  constraint  equations,  but  not  inequations  (inequalities). 
Actually,  Brooks  could  handle  a  limited  form  of  nonlinearity 
associated  with  the  trigonometric  forms  involved  in  rotation. 

The  process  of  recognition,  in  the  current  context,  is  to 
first  form  assignments  between  vertex  groups  in  each  projec¬ 
tion  and  then  to  test  the  algebraic  consistency  of  the  resulting 
equations.  The  consistency  is  tested  by  determining  if  the 
ideal  of  the  set  of  equations,  taken  as  polynomials  equal  to 
zero,  contains  the  unity  element.  Briefly,  the  ideal  is  a  set  of 
polynomials  that  are  all  of  the  linear  combinations  of  polyno¬ 
mials  from  the  original  set.  If  the  ideal  contains  the  unity 
element,  then  the  set  of  equations  is  inconsistent.  These  con¬ 
cepts  will  be  discussed  in  detail  later,  but  it  can  be  seen  that  a 
unit  element  in  the  ideal  can  generate  any  polynominal  as  a 
linear  combination.  This  condition  is  similar  to  that  arising 
from  an  inconsistent  set  of  axioms  in  logic,  where  any  state¬ 
ment  can  be  proven  by  reductio-ad-absurdum.  The  presence 
of  unity  in  the  ideal  can  be  determined  using  a  system  of 
term  rewriting  rules,  called  the  Grobner  basis  [Kapurl.  The 
Grobner  basis  is  generated  from  the  original  set  of  polynomi- 
nals  using  a  completion  procedure  involving  the  interaction 
of  pairs  of  polynomials,  expressed  as  rewrite  rules. 

If  the  equations  are  consistent,  then  the  two  views 
correspond  to  the  same  three-dimensional  object;  or  at  least 
they  share  some  solutions  for  the  possible  objects  that  are 
consistent  with  both  projections.  The  consistency  also  allows 
a  more  specific  determination  of  the  three-dimensional 
configuration  of  the  object.  For  example,  if  the  transforma¬ 
tion  between  the  views  is  given  in  advance,  then  the 
correspondence  between  equations  is  similar  to  the  feature 
matching  done  in  classical  stereo  analysis.  The  matching  be¬ 
tween  vertices  provides  a  disparity  or  depth  value  for  each 
vertex  match. 

If  the  transformation  between  views  is  unknown,  then 
the  depths  cannot  be  determined,  but  only  constrained  by 
the  assignment  between  a  vertex  from  each  view.  It  is  rea¬ 
sonable  to  refer  to  this  case  as  algebraic  stereo ;  the  object  sur¬ 
face  depth  is  not  explicitly  determined  but  the  variety  (set  of 
equation  zeroes)  of  the  equations  represents  a  space  of  possi¬ 
ble  object  surfaces.  In  any  case,  the  introduction  of  new  con¬ 
straints,  either  from  hypotheses  about  geometric  constraints 
on  groups  of  projection  elements,  or  from  new  views  of  the 
object,  will  reduce  the  number  of  unknown  coordinate 
values. 

The  remainder  of  the  paper  will  give  details  about  the 
constraints,  the  analysis  of  consistency  and  some  examples  of 
experiments. 


The  Viewing  Transformation 

The  situation  of  interest  to  us  here  is  illustrated  in  Figure 
2.1.  A  polyhedral  object  is  being  viewed  from  two  or  more 
viewpoints.  To  simplify  the  discussion  we  assume  that  the 
object  coordinate  frame  is  the  same  as  that  of  the  first  view 
point.  There  is  a,  possibly  unknown,  coordinate  transforma¬ 
tion  between  the  two  viewpoint  locations.  The  three- 
dimensional  geometry  and  topological  structure  of  the  poly¬ 
hedron  is  not  known.  All  that  is  available  are  the  two- 
dimensional  projections  of  the  object  in  each  viewplane,  n 
and  it'.  The  goal  is  to  determine  if  the  two  projections  actu¬ 
ally  correspond  to  the  same  polyhedron. 

First  let  us  consider  the  transformation  between  the 
three-dimensional  vertex  locations  and  their  corresponding 
two-dimensional  position  in  the  image  plane.  The  natural 
transformation  that  results  from  a  standard  imaging  system  is 
the  perspective  transformation.  However,  it  is  not  essential 
to  use  this  exact  relationship  in  most  practical  situations.  It 
can  be  shown  that  an  approximation  to  perspective,  the  affine 
transformation,  is  quite  appropriate  for  most  viewing  situa¬ 
tions  [Thompson  and  Mundy]. 

The  form  of  the  affine  transformation  is  given  by  the  fol¬ 
lowing  matrix  relation: 

p  =  w  [I2][R]  P  +  p0 

where  w  is  an  affine  scale  factor  and  R  represents  a  rotation 
matrix  for  the  orientation  of  the  object  reference  frame  rela¬ 
tive  to  the  image  reference  frame.  The  matrix  [I2]  indicates 
the  projection  from  three  dimensions  into  the  two- 
dimensional  image  plane, 

1  0  0 
I2=  0  1  0 
0  0  0 

p  is  the  location  of  the  projected  vertex  position  vector,  P,  in 
the  image  plane.  The  two-dimensional  vector,  p0,  represents 
a  translation  in  the  image  plane. 

The  rotation  matrix,  R.  can  be  represented  as  the  product 
of  three  matrices. 

R  -  [R,HR,][RX] 

which  represent  rotations  about  each  of  the  view  plane  coor¬ 
dinate  axes.  For  example. 

10  0 
Rx  =  0  Cosd>  Sind) 

0  -Sin<£  Cosd> 

where  <b  is  the  angle  of  rotation  about  the  x  axis.  The  com¬ 
plete  transformation  thus  requires  six  parameters:  three  rota¬ 
tions,  two  translation  components  and  a  scale  factor. 

Without  loss  of  generality,  we  assume  that  the  projection 
of  the  object  into  the  first  viewplane  involves  no  rotation, 
translation  and  a  unity  scale  factor.  The  only  unknowns  are 
the  depths  (z  component)  of  the  three-dimensional  object 
vertices.  The  x,y  components  of  vertex  P  are  just  the  vertex 
projection  in  image  coordinates. 

The  affine  viewing  projection  and  the  transformation  be¬ 
tween  views  are  combined  into  a  single  transformation  be¬ 
tween  corresponding  image  vertex  locations  in  the  two  views. 
That  is. 


799 


p'  -  W  [IjllR]  P  +  p'o 

where  p'  is  the  location  of  the  projection  of  vertex  P  in  the 
second  view.  In  this  case,  R  represents  the  rotation  between 
image  reference  frames.  The  scale  factor,  w,  arises  from 
differences  in  image  depths  between  views,  p'o  is  the  com¬ 
ponent  of  the  translation  vector  between  viewpoints  which 
lies  in  the  second  viewplane,  it'. 

It  should  be  noted  that  our  procedure  assumes  the  loca¬ 
tion  of  projected  image  vertices  to  be  exact.  This  assumption 
is  not  appropriate  when  dealing  with  actual  intensity  image 
data.  Feature  locations  can  be  at  least  several  pixels  in  error 
for  actual  images.  Ultimately,  this  problem  must  be  solved, 
but  the  main  focus  of  this  paper  is  to  explore  the  formal 
properties  of  geometric  and  topological  constraints  in  image 
projections. 

The  Topology  of  Projections 

In  addition  to  the  geometric  relationships  just  discussed, 
it  will  be  important  to  consider  the  relationship  between  the 
topology  of  the  three-dimensional  polyhedral  surface  and  the 
topology  of  the  corresponding  projection.  We  observe  that  a 
polyhedral  image  is  a  segmentation  of  the  image  plane  into 
open  and  connected  regions,  joined  by  vertices  and  edges. 
We  call  these  regions,  faces  of  the  image,  and  define  a 
vertex-edge-face  topology  in  the  image  space  similar  to  the 
one  for  3D  polyhedra.  Using  point  set  topology  (Henle], 
vertices,  edges,  faces  of  a  polyhedral  image  are  defined  as 
follows: 

vertex  -  A  vertex  is  a  point  of  the  image  plane  which  is  the  inter¬ 
section  of  at  least  two  edges  of  the  image.  The  angular  sectors 
between  consecutive  edges  around  the  vertex  must  be  part  of  dis¬ 
tinct  faces  of  the  image.  So  a  vertex  is  not  only  a  connection  of 
distinct  edges,  but  also  a  junction  of  distinct  faces  of  the  po¬ 
lyhedral  image. 

edge  -  An  edge  is  a  segment  of  the  image  plane  which  connects 
two  distinct  faces  of  the  image.  The  edge  is  closed  iff  its  two  end 
points  are  both  vertices  of  the  image.  The  edge  is  open  iff  only 
one  of  its  end  points  is  a  vertex  of  the  image.  In  this  case,  the 
edge  must  be  cut  off  by  the  image  boundary. 

cycle  -  A  cycle  is  a  closed  chain  of  vertices  and  edges.  A  cycle  C 
is  an  out-cycle  (resp.  in-cycle)  of  a  face  F,  iff  the  direction  of  the 
cycle  C  is  clockwise  (resp.  counter-clockwise),  as  we  walk  along 
C  with  the  face  F  on  the  right  hand  side.  Since  all  edge  of  the  cy¬ 
cle  is  bounded  by  two  vertices  in  the  cycle,  the  cycle  is  a  closed  set 
of  the  image  plane. 

face  -  A  face  is  a  connected  subset  of  the  image  plane  locally 
bounded  by  chains  of  edges,  called  a  boundary.  A  face  is  closed 
iff  its  boundary  includes  one  out-cycle,  zero  or  many  in-cycles.  A 
face  is  open  iff  its  boundary  is  not  closed,  or  it  is  only  bounded 
from  the  interior  by  in-cycles.  In  this  case,  the  face  must  be  cut 
off  by  the  image  border. 

With  the  point  set  topology  in  the  image  space  defined 
above,  the  set  union  of  all  topological  sets  (vertices,  edges, 
and  faces)  of  the  image  is  equal  to  the  open  set  of  the  image 
plane  strictly  inside  the  image  boundary.  Note  that  no  point 
of  the  image  is  strictly  inside  two  distinct  topological  sets.  So 
mathematically,  the  segmentation  of  the  image  plane  can  be 
represented  by  a  planar  network  of  nodes.  The  nodes  are 
not  only  the  vertices  as  in  [Huffman,  Clowes,  Waltz],  but 
also  the  edges  and  faces  of  the  image.  The  polyhedral  image 


is  represented  more  explicitly  by  a  planar  network  of  nodes 
linked  between  themselves  by  adjacency  links.  This  planar 
network  leads  to  a  parallel  implementation  of  image  labeling. 

For  now,  our  polyhedral  world  excludes  Origami  objects 
that  have  folds  of  planar  surfaces.  So,  the  junctions  in  the 
image  come  either  from  occlusion,  or  from  projection  of  3D 
polyhedral  vertices.  We  use  Huffman-Clowes  junctions  for 
occlusion  and  trihedral  vertices  [Huffman,  Clowes],  and 
Malik’s  Fundamental  Junction  Equation  for  multihedral  ver¬ 
tices  [Malik].  However,  junctions  at  vertices  are  not  the  only 
possible  local  constraints  in  labeling.  We  introduce  two  other 
labeling  constraints  around  edges  and  face  boundaries  of  the 
image,  described  respectively  by  junction-pairs,  and 
junction-loops.  These  extensions  of  the  labeling  system  pro¬ 
vide  a  unified  body  of  constraints  over  the  vertex,  edge,  and 
face  components  of  a  projection. 

Since  a  face  is  an  open  and  connected  region,  its  bounda¬ 
ries  must  be  disjoint.  We  can  label  the  boundaries  indepen¬ 
dently  of  each  other.  The  labeling  constraints  of  these 
boundaries  are  represented  by  n-tuples  of  junctions,  called 
junction-loops.  A  junction-loop  is  a  consistent  labeling  of  the 
chain  or  cycle  of  vertices  and  edges.  Previously,  the  labeling 
consistency  over  arbitrary  open  chains  of  vertices  and  edges 
is  captured  by  Waltz’s  filtering  process,  [Waltz],  Waltz  and 
others  are  forced  to  do  depth-first  search  for  global  labelings 
to  capture  the  labeling  consistency  over  boundary  cycles.  We 
also  must  do  a  depth-first  search  to  enumerate  all  the 
junction-loops  for  a  face  boundary,  however  our  depth-first 
search  is  local  to  a  face,  rather  than  global  to  the  whole  im¬ 
age. 

Similarly,  a  junction-pair  is  a  consistent  local  labeling  of 
an  edge.  The  junction-pair  is  a  more  complete  description  of 
the  edge  than  the  edge-label.  Junctions,  junction-pairs,  and 
junction-loops  are  local  labelings  of  a  node  consistent  with  it¬ 
self  and  all  its  adjacent  neighbors.  Figure  3.1.  Just  as  faces 
joined  by  vertices  and  edges  completely  represent  the  image 
segmentation,  the  junctions,  junction-pairs,  and  junction- 
loops  completely  describe  the  local  labeling  constraints  be¬ 
tween  the  nodes  in  the  image. 

Nodes  also  have  labels.  For  example,  vertices  are  labeled 
V,  Y,  A,  T,  or  M,  respectively,  for  L,  Fork,  Arrow,  T,  or 
Multihedral  junctions.  Edges  are  labeled  +,  -,  or  >,  respec¬ 
tively,  for  convex,  concave,  or  occluding.  Faces  are  labeled 
B,  V,  or  O,  respectively,  for  background,  visible,  or  partially 
occluding  face.  Labels  and  local  labelings  describe  all  the  la¬ 
beling  constraints  at  a  node,  and  are  organized  into  hierar¬ 
chies. 

Labels  and  labelings  at  a  node  are  related  by  the  subset 
relation  between  hierarchical  classes.  The  hierarchies  and  the 
subset  relation  lead  to  formal  definitions  of  constraint  satis¬ 
faction  between  two  adjacent  nodes.  The  image  topology 
leads  to  a  uniform  constraint  propagation  over  all  vertex- 
edge,  edge-face,  and  face-vertex  links.  Labeling  of  a  po¬ 
lyhedral  image  is  equivalent  to  constraint  satisfaction  and 
propagation  (CSP)  over  all  nodes  in  the  image.  Each  node 
has  local  constraints  described  by  labels  and  labelings.  Each 
node  only  interacts  with  its  adjacent  neighbors.  The  result  of 
CSP  is  local  consistencies  or  inconsistencies  at  all  the  nodes 
in  the  network. 

Nguyen  proved  by  induction  that  globally  consistent  label¬ 
ings  exist  if  and  only  if  all  the  nodes  in  the  network  are  local- 


800 


ly  consistent.  The  induction  proof  confirms  that  the  labels 
and  labelings  at  a  node  are  necessary  and  sufficient  local  con¬ 
straints  for  the  labeling  of  an  image.  The  output  is  a  net¬ 
work  of  nodes,  with  labels  and  local  labelings  attached  to 
each  node.  This  represents  all  locally/globally  consistent  la¬ 
belings  of  the  polyhedral  image. 

Figure  3.2  traces  the  parallel  labeling  of  two  blocks,  one 
on  top  of  the  other.  The  parallel  labeling  starts  with  the  input 
image,  and  default  labels  for  all  the  nodes  (i.e.  vertices, 
edges,  and  faces)  in  the  image,  frame  1.  Then,  it  finds  all  lo¬ 
cal  labeling  constraints  at  all  vertices,  face  boundaries,  and 
edges  of  the  image,  described  respectively  by  junctions, 
junction-loops,  and  junction-pairs,  frames  2  to  4.  Constraint 
satisfaction  and  propagation  is  done  uniformly  at  all  the 
nodes  in  the  image,  from  every  node  to  its  neighbors, 
through  vertex-edge,  edge-face,  and  face-vertex  links. 
Frames  5  and  6  describe  the  result  of  CSP.  Attached  to  each 
node  is  the  number  of  local  labelings. 

The  face  surrounding  the  two  blocks  has  16  local  label¬ 
ings,  corresponding  to  the  blocks  floating  in  air,  or  resting 
against  some  imaginary  surface  at  some  of  its  bounding 
edges,  frame  5.  The  two  interpretations  correspond  to  the 
face  labeled  as  image  background  (B)  or  as  polyhedral  face 
(F).  Note  that  the  blocks  sitting  on  top  of  a  horizontal  sur¬ 
face  can  be  thought  as  the  blocks  floating  in  air,  and 
infinitesimally  touching  the  surface.  The  surrounding  face  is 
a  touchable  background,  and  is  labeled  B.  The  detection  of 
the  background  face  surrounding  the  blocks  further  restricts 
the  labeling  of  the  blocks,  without  changing  their  shape  in¬ 
terpretations,  frames  7  and  8. 

Labeling  identifies  significant  groups  of  vertices  which  can 
be  used  to  derive  algebraic  constraints.  For  example,  all  the 
vertices  in  a  visible  face  (labeled  V)  are  related  by  coplanarity 
equations,  frame  8  of  Figure  3.2.  Locally,  vertices  connected 
by  convex  or  concave  edges  (labeled  respectively  +,  -)  are 
on  the  planes  of  the  adjacent  faces.  A  vertex  at  a  T  occlusion 
only  belongs  to  the  occluding  face.  Occlusion  gives  an  ine¬ 
quality  relating  the  depths  of  the  vertices  on  the  same  line  of 
sight,  but  lying  on  different  faces  of  the  3D  scene  [Sugiharal. 
Currently,  inequalities  are  not  handled  and  we  rely  only  on 
coplanarity  and  projection  constraints.  The  next  section  will 
describe  the  algebraic  form  of  these  constraints  in  detail. 

3.  ALGEBRAIC  CONSTRAINTS 

As  indicated  above,  we  view  an  object  model  to  be  a  sys¬ 
tem  of  algebraic  equations.  These  equations  come  from  vari¬ 
ous  sources: 

1.  The  affine  transformation  of  vertices. 

2.  The  constraints  on  vertex  groups. 

One  such  vertex  group  is  the  face.  The  constraints 
associated  with  the  face  is  the  coplanarity  of  the  face's 
vertices. 

The  above  constraints  correspond  to  constraints  derived 
from  a  single  image.  Additionally,  we  want  to  determine 
whether  an  assignment  of  vertices  between  vertex  groups  of 
two  images  is  consistent.  These  constraints  are  introduced 
by  equating  vertex  variables  between  the  two  images. 

The  basic  idea  behind  our  approach  is  to  use  these  con¬ 
straints  to  form  maximally  consistent  assignment  sets.  A 
maximally  consistent  assignment  set  is  a  consistent  set  of 
vertex  assignments  such  that  the  addition  of  any  additional 


vertex  assignment  would  cause  inconsistency.  The  set  of 
equations  associated  with  a  maximally  consistent  assignment 
set  forms  a  model  for  the  object. 

A  set  of  constraints  may  be  consistent  simply  because 
there  are  too  many  degrees  of  freedom.  Such  a  set  of  con¬ 
straints  is  of  little  use  in  further  refining  a  model.  Thus, 
when  we  use  a  vertex  group  to  constrain  a  set  of  assign¬ 
ments,  we  want  to  make  sure  that  its  constraints  are 
sufficiently  strong  to  either  cause  inconsistency,  or,  in  case  of 
a  consistent  assignment,  to  allow  us  to  significantly  refine  the 
model.  Thus  it  is  natural  for  us  to  ask  to  what  extent  does  a 
vertex  group  refine  our  3D  model  and  allow  us  to  detect  in¬ 
consistency.  We  will  pursue  these  questions  in  this  section. 

We  start  with  the  affine  transformation  equations  for  a 
vertex  group  consisting  of  n  vertices.  These  equations  were 
presented  earlier.  As  indicated  there  are  two  equations  per 
vertex,  one  each  for  x  and  y.  Thus,  if  we  are  trying  to  deter¬ 
mine  the  consistency  of  two  images,  we  will  have  a  total  of 
4n  equations,  2n  for  each  image.  Of  course  these  equations 
are  not  necessarily  independent  of  each  other.  As  for  the 
unknowns,  there  are  six  parameters  for  each  transformation, 
the  three  angles,  the  two  translation  parameters,  and  the 
scaling  factor.  Additionally,  the  x,  y,  and  z  components  of 
each  object  vertex  are  unknown,  giving  an  additional  3n  un¬ 
knowns.  Thus  if  we  just  consider  the  equations  generated 
from  the  affine  transformation,  we  have  4n  equations,  and 
3n+  12  unknowns.  If,  with  no  loss  in  generality,  we  assume 
that  the  object  coordinate  frame  is  the  same  as  the  coordi¬ 
nate  the  frame  for  one  of  the  images  (that  is,  its  transforma¬ 
tion  is  simply  identity),  then  we  have  2n  equations  and  n  +  6 
unknowns.  Thus  with  no  other  constraints  we  need  at  least 
six  vertices  in  a  vertex  group  to  detect  an  inconsistent  assign¬ 
ment. 

Now  suppose  that  the  vertext  group  is  additionally  con¬ 
strained  to  lie  in  a  plane.  The  equation  describing  co¬ 
planarity  is:  ax  +  by  +  cz  +  d  =  0,  where  a,  b,  c,  and  d  are  new 
variables  which  determine  a  plane,  and  (x.y.z)  is  a  vertex  in 
the  object.  For  each  point  in  the  vertex  group  there  will  be 
one  plane  equation.  Thus  for  a  vertex  group  representing  a 
face  with  n  vertices  there  will  be  3n  equations  and  n  +  10  un¬ 
knowns.  Thus  with  no  other  constraints  we  would  need  at 
least  five  vertices  in  a  vertex  group  to  detect  an  inconsistent 
assignment. 

Note  that  a  similar  analysis  can  be  found  in  [Sugiharal, 
who  also  represents  constraints  as  sets  of  equations.  A  major 
difference  between  our  approaches  is  the  manner  in  which 
under-constrained  equations  are  handled.  Sugihara  deals 
with  the  problem  by  introducing  cost  functions  for  intensity 
and  texture  properties,  and  then  uses  a  minimization  pro¬ 
cedure  to  select  from  the  set  of  under-constrained  solutions  a 
“best"  solution.  In  our  approach  we  try  to  infer  as  much  as 
possible  from  the  available  constraints  by  solving  the  set  of 
equations  symbolically.  An  advantage  of  Sugihara’s  method 
is  that  he  can  deal  with  inequalities  and,  to  some  extent,  im¬ 
precise  data.  A  main  advantage  of  our  approach  is  that  a  set 
of  algebraic  equations  can  naturally  represent  a  set  of  multi¬ 
ple  solutions  to  an  under-constrained  problem;  i.e.,  we  are 
not  forced  to  select  a  specific  solution. 

We  now  describe  our  symbolic  approach  to  testing  con¬ 
sistency. 


801 


4.  TESTING  CONSISTENCY 

Given  a  set  of  equations  that  describe  the  affine  pmiec- 
tion,  topological  properties,  and  geometric  properties  !<>:  a 
vertex  group,  we  must  be  able  to  solve  the  following  prob¬ 
lems: 

1.  Is  an  assignment  of  vertices  in  two  projections  con¬ 
sistent? 

2.  Given  a  consistent  assignment,  what  can  we  conclude 
about  the  transformation  parameters  and  depth  of  the 
object? 

3.  Given  two  vertex  groups  with  consistent  assignments, 
can  they  be  merged  in  a  consistent  manner? 

The  approach  to  these  problems  is  the  same  approach 
used  in  (Kapur)  to  prove  geometry  theorems.  The  main  idea 
is  that  a  set  of  assignments  is  consistent  if  and  only  if  the 
equations  describing  the  assignments  have  a  common  zero. 
The  remaining  problems  can  be  solved  as  a  by-product  of 
testing  the  consistency  of  a  set  of  equations. 

Algebraic  Foundation 

As  indicated  above,  the  main  problem  is  to  determine 
whether  a  set  of  equations  have  a  common  zero.  In  this  sec¬ 
tion  we  give  a  precise  formal  description  of  this  problem. 

Given  the  field  Q  of  rationals,  let  Q[x| . xj  be  the  set 

of  all  polynomials  with  variables  X(,  ....  xn,  and  rational 
coefficients.  We  will  be  considering  solutions  of  equations  in 
Qlxi . xn]  belonging  to  the  set  of  complex  numbers  C. 

Definitions:  Let  p  €  Qlx, . xn].  The  equation  P  =  0  is 

consistent  iff  there  exist  v,,  ...,  vn  €  C.  such  that 

p(v, . v„).  the  result  of  substituting  v, . vn  for 

x, . xn,  respectively,  in  p  evaluates  to  0.  (vt . v„)  is  a 

zero  of  p  in  Cn.  The  equation  p  *  0  is  inconsistent  otherwise. 
Given  a  set  S  c  Q(x|,  ...,  xnl,  S  is  consistent  iff  there  exist 
v,  ....  vn  €  C.  such  that  for  every  p  €  S,  p(v,  ....  v„)  evalu¬ 
ates  to  0.  The  set  S  is  inconsistent  otherwise. 

Given  two  images.  II  and  12,  let  SI  be  the  set  of  equa¬ 
tions  generated  from  II,  and  S2  be  the  set  of  equations  gen¬ 
erated  from  12,  then  the  problem  of  testing  whether  a  set  of 
vertex  assignments,  VA,  is  consistent  is  almost  equivalent  to 
the  problem  of  testing  whether  the  set  of  equations 
T  =  SI  U  S2  U  VA  is  consistent. 

The  two  problems  are  not  exactly  identical  since  a  set  of 
polynomial  equations  are  consistent  even  if  the  only  common 
zeros  are  complex,  whereas  in  this  case  the  set  of  vertex  as¬ 
signments  would  not  be  physically  realizable.  There  are  other 
approaches  that  avoid  this  problem  (see  for  example  lAr- 
nonl),  but  they  are  considerably  more  computationally  ex¬ 
pensive  than  our  approach. 

Hilbert’s  Nullstellensatz 

Hilbert's  Nullstellensatz  Ivan  der  Waerden)  allows  us  to 
determine  whether  a  set  of  equations  is  consistent. 

Let  p  €  Q[x,  ...,  x„]  that  evaluates  to  zero  at  all  the  com¬ 
mon  zeros  of  pIt  ...,  p,  in  the  n-dimensional  affine  space  of  C, 
then  pq=-hiPi+h2p2+ ... +hjPi  for  some  natural  number  q  and 
polynomials  h|,  h2.  ...,  hj  €  Qlx . . x„], 

A  special  case  of  the  above  theorem  is  the  following  im¬ 
portant  corollary:  If  p,,  pj  have  no  common  zeros  in  the 
n-dimensional  affine  space  of  C,  then  1  belongs  to  the  ideal 
generated  by  Pi . Pj. 


Thus  the  problem  of  testing  whether  a  set  of  vertex  as¬ 
signments  is  consistent  is  reduced  to  the  problem  of  testing 
whether  the  ideal  generated  by  a  set  of  polynomials  contains 
1.  Our  method  of  solving  this  problem  uses  the  Grobner 
basis  compulation  described  next. 

The  Grobner  Basis 

An  important  problem  in  ideal  theory  is  the  ideal 
membership  problem.  Given  a  set  of  polynomials,  S,  let  (S) 
denote  the  ideal  generated  by  S.  The  set  S  is  called  a  basis  of 
the  ideal  (S).  Given  a  polynomial,  p,  and  a  basis  S,  the  ideal 
membership  problem  is  to  determine  whether  p€(S). 
[Buchberger]  invented  the  Grobner  basis  to  solve  this  and 
other  problems  in  polynomial  ideal  theory.  Intuitively  a 
Grobner  basis,  G,  is  a  set  of  polynomials  that  have  the  prop¬ 
erty  that  when  used  as  rewrite  rules  they  will  reduce  any  po¬ 
lynomial  in  (G)  to  0. 

Polynomials  as  Rewrite  Rules 

The  main  idea  behind  the  Grobner  basis  is  to  view  a  set 
of  polynomials  as  a  rewrite  system.  A  Grobner  basis  is  then  a 
finitely  terminating,  confluent  rewrite  system.  Before 
proceeding  further  we  give  some  general  background  on 
rewrite  systems. 

A  rewrite  rule  is  a  pair  (l,r),  usually  written  1  J  r.  I  is 
called  the  left  hand  side,  r  the  right  hand  side.  A  rewrite 
rule  can  rewrite  an  “object"  if  some  part  of  the  object 
matches  the  left  hand  side  of  the  rule.  For  example  the 
string  rewriting  rule,  a  —  b.  can  rewrite  the  string  ab  to  bb. 

A  rewrite  system  is  a  collection  of  rewrite  rules.  If  an  object 
cannot  be  rewritten  any  further  then  the  object  is  said  to  be 
in  normal  form.  We  use  the  notation  s  - *R  t.  to  mean  that  s 
can  be  rewritten  to  t  in  one  step  using  rewrite  system  R. 
s  — *R  t  means  that  s  can  be  rewritten  in  arbitrarily  many 
steps  to  t. 

Two  important  properties  of  rewrite  systems  are  finite  ter¬ 
mination  and  confluence.  Definition:  —  is  finitely  terminating 
(sometimes  called  noetherian)  iff  there  does  not  exist  an 
infinite  sequence  t,  —  t2  —  ...  .  For  example,  the  rewrite 
system: 

a  —  b 

b  —  a 

is  not  finitely  terminating  since  the  string  a  can  be  rewritten 
indefinitely. 

Definition:  —  is  confluent  iff_ 

V  t, .  t2.  t3  (t|— *t2  A  t|—*t3)  =*>  3  t4(t2—*t4  A  t,— *t4l. 

Intuitively,  this  means  that  if  an  object  can  be  rewritten  in 
two  different  ways  to  different  objects  then  the  two  different 
objects  can  eventually  be  rewritten  back  to  a  common  object. 
If  a  rewrite  system  is  both  finitely  terminating  and  confluent 
then  every  object  has  a  unique  normal  form.  For  example, 
the  rewrite  system: 

a  —  b 

ab  —  b 

is  not  confluent  since  the  string  ab  has  two  normal  forms, 
namely  bb  and  b.  By  adding  the  rule  bb  —  b,  the  above 
rewrite  system  can  be  made  confluent. 

We  now  show  how  a  set  of  polynomials  can  be  viewed  as 
a  rewrite  system,  and  how  from  a  finite  set  of  polynomials 
we  can  generate  a  finitely  terminating,  confluent  rewrite  sys¬ 
tem.  i.e.  a  Grobner  basis.  We  first  show  how  a  polynomial 
can  be  viewed  as  a  rewrite  rule. 


802 


Given  a  polynomial  ring,  Q[x,  ...,xn],  with  indeter- 
minates  X|  ...,x„,  and  coefficients  from  Q,  we  define  a  term 
as  a  sequence  of  indeterminates  or  1,  A  monomial  is  a  term 
with  a  coefficient  from  Q.  The  simplified  sum-of-products 
form  of  a  polynomial  is  a  sequence  of  monomials  all  with 
distinct  terms.  By  using  associativity,  commutativity,  and  dis- 
tributivity  any  polynomial  can  be  put  into  simplified  sum-of- 
products  form. 

For  example,  given  the  polynomial  (x,x2  +  x2  +  x2)(x,), 
its  simplified  sum-of-products  form  is  X|X|X2  +  2x|X2.  This 
polynomial  will  usually  be  written  as  X|2x2  +  2X|X2. 

We  want  to  turn  a  set  of  polynomials  into  a  finitely  ter¬ 
minating  rewrite  system.  The  general  method  of  ensuring 
that  a  rewrite  system  is  finitely  terminating  is  to  determine  a 
well  founded  ordering,  >,  on  polynomials  and  to  make  sure 
that  for  every  rewrite  rule  1  — r,  l>  r.  Thus  in  order  to  use 
polynomials  as  rewrite  rules  we  must  first  define  a  well 
founded  ordering  on  polynomials. 

We  start  by  first  defining  an  ordering  on  the  indeter- 
minates:  x,  <  x2  <  ...  <  xn.  This  ordering  induces  a  total 
ordering  on  terms  in  many  different  ways,  two  of  which  are 

1.  Total  degree  ordering 

In  this  ordering  we  define  for  a  term,  t,  deg(t)  =  the 
number  of  indeterminates  in  t.  For  example, 
deg(x2x3)  =  3,  and  deg(l)  =  0.  Terms  are  compared  in 
the  following  way: 

i.  if  deg(t|)  <  deg(t2)  then  tt  <  t2 

ii.  if  degft, )  =  deg(t2)  then  compare  l,  and  t2  using 
lexicographic  ordering. 

For  example  x2  <  xf  since  deg(x2)  <  degfx2). 

2.  Lexicographic  ordering 

In  this  ordering  terms  are  only  compared  lexicograph¬ 
ically,  the  degree  of  a  term  is  not  taken  into  account. 
For  example  x2  <  x2,  since  x2  is  lexicographically 
smaller  than  x2. 

The  ordering  that  we  choose  to  use  depends  on  the  appli¬ 
cation  we  have  in  mind.  In  general,  the  degree  ordering  is 
preferable  if  we  are  just  trying  to  detect  contradiction,  while 
the  lexicographic  ordering  is  preferable  if  we  want  to  solve  a 
system  of  equations.  This  will  be  discussed  further  later. 

In  any  case  both  orderings  can  be  extended  to  monomials 
and  then  to  polynomials.  If  m,  =  ctt|,  and  m2  =  c2t2,  where 
C|  and  c2  are  coefficients  and  t|  and  t2  are  terms  then  m|  < 
m2  iff  ti  <  t2. 

Let  Pi ,  and  p2  be  two  polynomials  such  that 
Pi  —  ni|  +  m2  +  ...  +  mk,  and  p2  —  h,  +  h2  +  ...  +  h|,  both  in 
simplified  sum-of-products  form,  with  m,  >  m2  >  ...  >  mk, 
and  h|  >  h2  >  ...  >  h|.  Then  pi>p2iff  3i<l(mj>h,AV 

j  <  i  (iDj  —  hj) ).  It  is  easy  to  see  that  this  totally  orders  poly¬ 
nomials. 

In  polynomial  pj,  above,  ni|  is  called  the  head  monomial, 
and  Pi  can  be  transformed  to  the  following  rewrite  rule: 
nil- *—  m2-...  —  mk.  This  rewrite  rule  is  finitely  terminating 
since  under  our  ordering  mi  is  greater  than  the  polynomial 
—  m2— ...—  mk.  Note  also  that  by  dividing  the  entire  polyno¬ 
mial  by  C|  we  can  always  form  rewrite  rules  whose  left  hand 
side's  coefficient  is  1 . 


Now  that  we  have  a  way  of  ensuring  finite  termination, 
we  want  to  be  able  to  ensure  confluence,  intuitively,  the  way 
confluence  is  ensured  is  to  look  at  the  polynomial  rewrite 
system  and  try  to  determine  the  possible  ways  that  it  could 
be  used  to  rewrite  a  polynomial  in  more  than  one  way.  This 
is  done  by  taking  two  rules  and  from  their  left  hand  sides 
forming  a  new  term  that  can  be  rewritten  by  both  rules.  For 
example,  given  the  following  two  rules: 

1.  X|X2— *r,  (r ,  is  this  rule’s  right  hand  side,  whatever  it 
may  be.) 

2.  x,xj  — r2 

if  we  take  the  least  common  multiple  (1cm)  of  the  two  left 
hand  sides  we  get  xix2x3,  which  can  be  rewritten  by  both 
rules.  Using  rule  I  it  rewrites  to  x3‘rj.  Using  rule  2  it 
rewrites  to  X|*r2.  In  order  to  ensure  confluence  it  must  be 
the  case  that  the  polynomial,  x3*r)-X|*r2,  reduces  to  0. 
This  polynomial  is  called  an  s-polynomial  of  rule  1  and  rule  2, 
and  we  say  that  rule  1  and  rule  2  overlap.  If  with  the  current 
set  of  rules  this  polynomial  does  not  reduce  to  0  then  we  can 
force  it  to  reduce  to  0  by  transforming  its  reduced  form  to  a 
rewrite  rule  and  adding  it  to  the  current  system.  This  is  the 
basic  step  of  the  Grobner  basis  computation,  and  is  repeated 
until  all  s-polynomials  reduce  to  0.  Additionally,  it  is  useful 
to  use  the  new  polynomial  to  reduce  the  existing  polynomi¬ 
als,  hopefully  causing  previous  polynomials  to  reduce  to  0. 
Note  that  the  polynomials  that  can  be  reduced  using  the  new 
rule  have  to  be  removed  from  the  current  rule  set  and 
formed  into  new  rules. 

Here  is  an  outline  of  the  Grobner  basis  algorithm: 

Input:  F,  a  finite  set  of  polynomials. 

Output:  G,  a  Grobner  basis,  such  that  (G)  =  (F). 

1  G  :=  0;  pairset  :=  0.  newpolys  :=  F 

2  loop  until  1  6  G  or  (pairset  =  0  and  newpolys  =  0) 

3  loop  while  new  polys  ^  0 

4  p  :=  a  polynomial  from  new  polys 

5  new  polys  :=  new  polys  -  I  pi 

6  G  :=  G  U  Ipl 

7  add  to  pairset  pairs  of  polynomials  with  p  and 

polynomials  in  G 

8  (pl,p2)  :=  pair  from  pairset 

9  sp  :  =  normal  form  of  s-polynomial(pl,p2) 

10  remove  from  G  any  polynomials  that  are  reduceable  by 

sp,  remove  their  corresponding  pairs  from  from 
pairset.  and  add  these  polynomials  to  new  polys 

1 1  return  G 

Buchberger  showed  the  correctness  and  termination  of 
this  algorithm  [Buchberger). 

We  will  illustrate  the  algorithm  with  a  few  examples. 


Example  1: 

Let  F  =  [x2y- y2,xy2- 3),  with  x  >  y.  Turning  these 
equations  into  rewrite  rules  using  total  degree  ordering  yield: 

1.  x2y  —  y2 

2.  xy2  -  3 

Rules  1  and  2  yield  the  following  s-polynomial: 


803 


Thus  yielding  rule: 

3.  y3  -  3x 

Rules  2  and  3  overlap  in  the  following  manner: 


Thus  yielding  rule: 

4.  x2  — *  y 

This  new  rule  simplifies  rule  1  to  0,  thus  deleting  it.  Rules  2 
and  4  overlap: 


This  results  in  the  polynomial  y3-3x,  which  can  be  reduced 
to  0  by  rule  3.  Rules  3  and  4  do  not  have  to  be  overlapped 
since  they  do  not  have  any  variables  in  common,  and  the  re¬ 
sulting  s-polynomial  would  trivially  be  reducable  to  0  using 
rules  3  and  4. 

Since  there  are  no  pairs  that  we  have  not  considered, 
rules  2,  3,  and  4  form  a  Grobner  basis  for  F. 

Example  2: 

This  example  illustrates  how  the  Grobner  basis  can  be 
used  to  detect  inconsistency: 

Let  F  =  !,  +  x  -  12,  xy-  12, y2- 4|  with  x  >  y.  Turning 
these  equations  into  rewrite  rules  using  total  degree  ordering 
yield: 

1.  x2  —  -  y  -  12 

2.  xy  —  12 

3.  y3-4 

Rules  1  and  2  overlap: 


-xy  ■+■  12y  12x 

After  reduction  this  results  in  the  rule: 

4.  x  — y-1 

Which  can  be  used  along  with  rule  3  to  reduce  rule  2  to: 

5.  y-8 

Rule  5  then  reduces  rule  3  to: 

6.  1-0. 

and  we  have  a  contradiction,  showing  that  the  original  set  of 
polynomials  do  not  have  common  zero. 

Example  3: 

This  third  example  illustrates  a  major  drawback  of  this  ap¬ 


proach:  the  following  equations  do  not  have  a  common  real 
zero,  but  they  do  have  a  common  complex  zero,  thus  their 
Grobner  basis  does  not  contain  1 . 

Let  F  =  (x2  +  1|.  This  polynomial  is  its  own  Grobner  basis, 
and  yet  it  does  not  have  a  real  zero. 

Example  4: 

This  example  illustrates  how  the  solution  to  a  set  of  equa¬ 
tions  can  be  found  using  lexicographic  order. 

Let  F  =  (x2y-y2xy2-3),  with  x  >  y,  as  in  example  1. 
Turning  these  equations  into  rewrite  rules  using  lexicographic 
ordering  yield: 

1.  x2y  — y2 

2.  xy2  —  3 

Rules  1  and  2  yield  the  following  s-polynomial: 


y3  3x 
Thus  yielding  rule: 

3.  x  —  l/3y3 

Note  that  this  rule  is  oriented  differently  than  in  example  1. 
Rule  3  can  then  be  used  to  simplify  both  rules  1  and  2.  It 
turns  that  both  these  rules  simplify  to 

4.  y5  — 9 

Thus  the  final  Grobner  basis  consists  of  the  polynomials: 

1.  ys-  9 

2.  x  -  l/3y3 

Note  that  the  first  polynomial  is  in  terms  of  only  y,  while  the 
second  is  in  terms  of  x  and  y.  Thus,  solving  the  first  equa¬ 
tion  for  y  and  plugging  in  this  value  in  the  second  equation 
yields  the  solution: 

y  =  3y5  and  x  =  3I/S 

In  general,  running  the  Grobner  basis  computation  using 
lexicographic  ordering  generates  triangular  sets  of  equations 
like  the  one  in  this  example. 

We  now  give  examples  of  the  use  of  the  Grobner  basis  in 
solving  our  image  understanding  problem. 

5.  EXAMPLES 

In  this  section  we  illustrate  the  application  of  our  method 
on  a  very  simple  object.  We  defined  a  20  by  20  by  20  cube 
centered  about  the  origin  and  formed  two  orthographic  pro¬ 
jections  of  it  along  two  diagonals.  The  resulting  images  are 
shown  in  Figure  5.1.  The  first  image  corresponds  to  a  rota¬ 
tion  of  45  degrees  around  the  x-axis,  followed  by  a  rotation 
of  -35.26  degrees  around  the  y-axis,  followed  by  a  rotation 
of  -30  degrees  about  the  z-axis.  The  second  image 
corresponds  to  a  rotation  of  -45  degrees  around  the  x-axis, 
followed  by  a  rotation  of  -35.26  degrees  around  the  y-axis, 
followed  by  a  rotation  of  30  degrees  about  the  z-axis.  From 
these  projections  the  x  and  y  components  of  the  two  images 
were  calculated.  Since  these  are  real  numbers  and  the 
Grobner  basis  method  requires  exact  coordinates,  these 
numbers  had  to  be  represented  by  using  polynomial  equa- 


804 


tions.  For  example,  if  the  x  component  of  a  point  is  200l/2, 
then  it  would  be  input  as  x2  —  200=0,  instead  of  as 
x«  14.142137.  Note  that  the  Grobner  basis  does  not  allow 
us  to  distinguish  between  +200l/2  and  —  200l/2.  However,  if 
we  have  two  points  X|  and  x2,  one  whose  value  is  +200l/2 
and  the  other  whose  value  is  —  200l/2,  then  these  can  be 
represented  by  the  two  equations,  x2— 200  =  0,  and 
X|  +  x2  =  0. 

We  performed  several  experiments  testing  algebraic  con¬ 
sistency  on  this  cube.  These  experiments  were  performed 
using  an  implementation  of  the  Grobner  basis  algorithm  run¬ 
ning  on  the  Symbolics  lisp  machine.  These  experiments  are 
described  below: 

Experiment  1:  Determining  depth  given  the  transformation 
between  the  two  views. 

In  this  experiment  we  took  the  face  corresponding  to  ver¬ 
tices  1,  2,  3,  and  4  in  Figure  5.1  as  our  vertex  group.  The 
set  of  equations  corresponding  to  this  problem  consists  of  the 
union  of  the  following  equations: 

1 .  The  trig  identity. 

This  set  consists  of  the  equations  of  the  form 
cos2ili  +  sin2i/<  =  1,  one  for  each  of  the  six  rotation  an¬ 
gles  (three  per  image).  Note  that  the  angles  themselves 
are  not  actually  the  indeterminates  in  the  problem.  The 
actual  indeterminates  are  the  sine  and  cosine  of  the  an¬ 
gles.  Thus  for  angle,  <U,  there  would  correspond  two 
variables,  cos  psi,  and  sin  psi. 

2.  The  specification  of  the  transformation  between  the  two 
images. 

To  simplify  the  computation  we  used  the  coordinate 
system  of  the  first  image  as  the  object  coordinate  system. 
This  meant  that  the  three  angles,  i/i(,  U\,  and  <6t, 
corresponding  to  rotation  about  the  x,  y,  and  z  axis, 
respectively,  are  all  0.  This  is  specified  by  the  equations: 
cos_psi_  1-1=0,  and  sin  psi^  1  =  0,  etc. 

The  transformation  from  image  1  to  image  2  is  given 
by  i i2  =  -70.53,  #2  =  0,  and  <b2  -  0.  Since 
cos(-70.53)  =  1/3,  and  sin( — 70.53)  =  8/9l/2,  the  equa¬ 
tions  for  1 1>2  are  3cos_psi_2-  1  =  0,  and 
9sin_psi_22  -  8  =  0. 

Note  that  this  set  of  equations  makes  the  trig  identity 
equations  redundant. 

3.  The  specification  of  the  image  points  in  terms  of  the  ob¬ 
ject  points. 

As  indicated  earlier,  the  general  equation  for  the 
points  in  the  second  image  in  terms  of  the  first  image  is 

p2  -  w(I2l  [RJ  [Ry]  [RJ  (p, ]  +  po 

In  this  example  w  is  1,  and  Po  is  0.  Thus  only  the  ro¬ 
tation  matrix  is  left.  Instead  of  multiplying  the  three  ro¬ 
tation  matrices  out  we  simply  added  intermediate  vari¬ 
ables  to  keep  track  of  the  values  after  the  rotations 
about  each  axis.  For  example,  the  equations  describing 
the  rotation  of  the  object  point  x0  into  the  image  point 
x,  would  be 

x(  -  x0  -  0 

x2  -  (xcosfl  +  ysinO)  -  0 

x  -  (xcos4  -  ysin</> )  -  0 


In  experiments  it  turned  out  that  factoring  the  equations 
this  way  sped  up  the  Grobner  basis  computation.  We 
have  similar  equations  for  the  coordinates  of  each  point 
in  the  image. 

4.  The  coplanarity  equations. 

For  each  point,  (x,  y,  z)  in  a  vertex  group  we  would 
add  the  equation,  ax  +  by  +  cz  +  d  =  0.  Since,  in  our  ex¬ 
ample  none  of  the  cube’s  faces  passes  through  the  ori¬ 
gin,  we  know  that  d  is  not  0.  This  lets  us  divide  the 
equation  by  d,  giving  us  a  simpler  equation, 
ax  +  by  +  cz  f  1  =  0.  We  have  an  equation  such  as  this 
for  each  point  in  the  vertex  group. 

5.  The  vertex  assignments. 

In  order  to  equate  point  p  in  image  1  with  object 
coordinates  x,  y,  and  z,  with  point  p'  in  image  2  with  ob¬ 
ject  coordinates  x',  y',  and  z',  we  would  add  the  three 
equations: 

x'-  x  =  0, 

y  —  y  -  0,  and 

z'-  z  =  0. 

In  this  example  we  assigned  vertex  1  in  image  1  to 
vertex  1  in  image  2,  vertex  2  in  image  1  to  vertex  2  in 
image  2,  and  so  on. 

The  union  of  all  these  equations  made  up  the  input  to  the 
Grobner  basis.  In  all  there  were  126  equations.  (Some  of 
these  are  shown  in  Figure  5.2.)  However,  many  of  them  are 
redundant.  After  272  seconds  and  43  critical  pairs  the 
Grobner  basis  was  computed.  It  consisted  of  119  polynomi¬ 
als,  including  the  solution  to  the  vertex  depths. 

Experiment  2:  Determining  inconsistency  given  the  transfor¬ 
mation  between  the  two  views. 

This  problem  is  very  similar  to  the  first  problem.  The 
only  difference  is  in  the  assignment  set.  Instead  of  the  con¬ 
sistent  assignment  given  above,  we  gave  it  the  inconsistent 
assignment: 

VU  =  v2J 
vl,2  =  V2.| 
vl,3  =  v2.4 
Vl,4  “  V2.2 

where  v,j  is  vertex  j  in  image  i.  After  48  seconds  and  0  criti¬ 
cal  pairs  the  Grobner  basis  detected  inconsistency.  Note  that 
detecting  inconsistency  is  much  easier  than  determining  con¬ 
sistency.  This  is  because  contradiction  might  be  detected 
well  before  all  possible  pairs  are  examined. 

Experiment  3:  Computing  the  Grobner  basis  without  giving 
the  transformation  between  views. 

This  problem  is  severely  underconstrained.  In  fact,  with 
only  one  face  of  the  cube,  any  assignment  is  consistent. 
Given  the  same  assignment  as  in  experiment  1,  the  Grobner 
basis  computation  ran  for  1  hour,  generating  1824  critical 
pairs.  The  depths  and  rotation  parameters  were  still  only  par¬ 
tially  constrained.  Given  the  assignment  in  experiment  2, 
the  lisp  machine  ran  out  of  space  before  finding  a  Grobner 
basis. 

Experiment  4:  Extending  a  consistent  assignment  with 


eos 


another  vertex  group. 

In  this  problem  we  took  the  Grobner  basis  generated  in 
the  previous  experiment  and  added  to  it  the  equations 
corresponding  to  the  face  containing  vertices  3,  4,  5,  6  (see 
Figure  5.1).  With  a  consistent  assignment  the  Grobner  basis 
terminated  after  10  hours,  22  minutes  and  4432  critical  pairs. 
With  an  inconsistent  assignment  the  Grobner  basis  detected 
inconsistency  after  4  hours,  15  minutes  and  2055  critical 
pairs. 

Further  Experiments: 

We  performed  further  experiments  on  the  cube  by  modi¬ 
fying  some  of  the  experiments  above  to  include  a  scaling  fac¬ 
tor,  and  a  translation.  As  expected  this  resulted  in  longer 
running  times  for  the  Grobner  basis  computation,  but  these 
problems  still  finished. 

While  the  above  experiments  have  been  somewhat  suc¬ 
cessful,  they  indicate  a  need  to  make  progress  on  several 
fronts. 

6.  EXTENSIONS 

The  extensive  computation  required  for  the  examples  in 
the  previous  section  clearly  indicates  a  need  to  reduce  the 
size  of  the  Grobner  basis.  This  is  crucial  when  we  want  to 
augment  a  completed  Grobner  basis  with  additional  vertex 
groups,  or  when  we  want  to  merge  two  Grobner  bases  to¬ 
gether. 

In  these  cases  we  would  like  to  be  able  to  extract  from 
the  Grobner  basis  only  the  information  that  is  likciv  to  cause 
inconsistency.  For  example,  in  experiment  1  abo  o,  many  of 
the  119  polynomials  in  the  final  basis  inv^  _  mly  the  inter¬ 
mediate  variables  used  to  specify  the  imng.  points  in  terms 
of  the  object  points.  Once  the  ba>.s  us  been  computed, 
these  intermediate  variables  are  no  longer  of  any  use.  The 
information  that  will  be  useful  are  the  equations  involving 
the  affine  transformation  paiameters,  and  the  unknown  ver¬ 
tex  depths. 

This  information  can  be  isolated  by  using  the  lexicograph¬ 
ic  ordering  instead  of  the  degree  ordering.  However,  run¬ 
ning  the  Grobner  basis  with  lexicographic  ordering  can  be 
considerably  more  expensive  than  using  degree  ordering.  An 
alternative  is  to  first  generate  a  Grobner  basis  using  degree 
ordering,  and  then  run  the  Grobner  basis  computation  on 
the  resulting  set  of  equations  just  long  enough  to  generate 
the  equations  constraining  the  indeterminates  that  we  are  in¬ 
terested  in. 

Another  line  of  attack  would  be  to  classify  a  type  associat¬ 
ed  with  small  groups  of  equations.  The  type  class  would  be 
determined  by  geometric  and  topological  relations  that  define 
the  group.  If  such  a  type  function  can  be  defined,  then  it 
would  be  possible  to  avoid  generating  elements  of  the 
Grobner  basis  that  can  never  lead  to  meaningful  geometric 
deductions.  That  is,  one  would  not  attempt  to  generate  criti¬ 
cal  pairs  between  rules  of  dissimilar  type.  One  example  of 
such  a  type  class  has  already  been  discussed,  the  coplanar  set; 
if  we  had  another  group,  say  three  mutually  perpendicular 
edges,  it  would  not  seem  meaningful  to  generate  critical  pairs 
across  these  two  equation  sets. 

Another  major  area  of  extension  is  the  control  mecha¬ 
nism  for  generating  hypotheses  and  maintaining  consistent 
assignment  sets.  One  aspect  is  similar  to  truth  maintenance 


systems,  or  TMS,  where  one  maintains  a  network  of  depen¬ 
dencies  between  assumptions  and  conclusions.  In  our  case 
the  assumptions  concern  assignments  between  elements  of 
the  projections  in  the  two  views  as  well  as  assumptions  that 
may  be  made  about  the  object  or  the  viewing  transformation. 

The  other  aspect  is  to  use  a  hierarchy  to  define  the  con¬ 
trol  strategy.  Assumptions  about  the  object  and  viewing 
transformation  may  take  the  form  of  a  hierarchy  in  which 
more  specific  assumptions  are  tried  and  relaxed  until  one 
finds  the  least  general  assumption  that  leads  to  consistency. 
For  example,  in  the  case  of  the  viewing  transformation,  per¬ 
spective  is  more  general  than  affine  which  is  more  general 
than  orthographic.  Correspondences  between  views  that  are 
inconsistent  under  orthography  may  be  consistent  under  per¬ 
spective.  The  same  type  of  hieiarchy  can  apply  to  assump¬ 
tions  about  the  object;  for  example,  a  polyhedron  is  more 
general  than  a  rectilinear  solid,  which  is  more  general  than  a 
cube. 

REFERENCES 

[11  Arnon,  D.S.,  Collins,  G.E.,  and  McCallum,  S.,  “Cy¬ 
lindrical  Algebraic  Decomposition  I:  The  Basic  Algo¬ 
rithm,”  SIAM  J.  of  Computing  13,  pp.  865-877,  1984. 

[2]  Arnon,  D.S.,  Collins,  G.E.,  and  McCallum,  S.,  “Cy¬ 
lindrical  Algebraic  Decomposition  II:  An  Adjacency 
Algorithm  for  the  Plane,”  SIAM  J.  of  Computing  13, 
pp.  878-889,  1984. 

[3]  P.  Besl  and  R.  Jain,  “Three-Dimensional  Object 
Recognition,”  Computing  Surveys  17,  (1),  March  1985. 

[4]  Brooks,  R.A.,  “Symbolic  Reasoning  Among  3D 
Models  and  2D  Images,”  Artificial  Intelligence  17, 
p.  285,  1981. 

[5]  Buchberger,  B.,  “An  Algorithm  for  Finding  a  Basis  for 
the  Residue  Class  Ring  of  a  Zero-dimensional  Polyno¬ 
mial  Ideal,”  (in  German)  Ph.D.  Thesis,  Univ.  of 
Innsbruck,  Austria,  Math.  Inst.,  1965. 

[6]  Buchberger,  B.,  “Grobner  Bases:  An  Algorithmic 
Method  in  Polynomial  Ideal  Theory,”  in:  Multidimen¬ 
sional  Systems  Theory ,  N.K.  Bose  (ed. ) ,  pp.  184-232, 
Reidel,  1985. 

[7]  Clowes,  M.B.,  “On  Seeing  Things,"  Artificial  Intelli¬ 
gence  2  (1),  pp.  79-116,  1971. 

[8]  Connolly,  C.I.  and  Stenstrom,  J.R.,  “Construction  of 
Polyhedral  Models  from  Multiple  Range  Views,"  Proc. 
8th  International  Conference  on  Pattern  Recognition, 
1986. 

[9]  Crapo,  H.  “Spatial  Realizations  of  Linear  Scenes,"  in 
Structural  Topology,  no.  13,  1986. 

[10]  Faugeras,  O.,  Ayache,  N.,  Faverjon,  B.,  Lustman.  F., 
“Building  Visual  Maps  by  Combining  Noisy  Stereo 
Measurements,”  Proc.  IEEE  Conf.  on  Robotics  and  Au¬ 
tomation,  p.  1433,  1986. 

[11]  Faugeras,  O.  and  Hebert,  M..  “A  3D  Recognition  and 
Positioning  Algorithm  Using  Geometrical  Matching 
Between  Primitive  Surfaces,”  Proc.  7th  International 
Joint  Conference  on  A  I,  1983. 

[12]  Grimson,  E.  and  Lozano-Perez,  T.,  “Search  and  Sens¬ 
ing  Strategies  for  Recognition  and  Localization  of 


806 


Two-  and  Three-Dimensional  Objects,"  Proc.  3rd 
International  Symposium  on  Robotics  Research.  1985. 

[13]  Henle,  M„  A  Combinatorial  Introduction  to  Topology. 

Freeman  and  Co.,  San  Francisco,  1979, 

[14]  Herman,  M., Representation  and  Incremental  Construc¬ 
tion  of  a  Three-Dimensional  Scene  Model,  Carnegie- 
Mellon  University  Report,  CMU-CS-85-103,  1985. 

[15]  Huffman,  D.A.,  “Impossible  Objects  as  Nonsense 
Sentences."  Machine  Intelligence  6,  pp.  295-323,  1971 

[16]  Kapur,  D.,  “Geometry  Theorem  Proving  Using 
Hilbert’s  Nullstellensatz,”  NSF  Workshop  on 
Geometric  Reasoning,  June  1986. 

[17]  Lowe,  D Perceptual  Organization  and  Visual  Recogni¬ 
tion,  Kluwer  Academic  Publishers,  Boston  Mass., 
1985. 

[18]  Malik,  J.,  “Interpreting  Line  Drawings  of  Curved  Ob¬ 
jects,"  PhD  Thesis,  Stanford  University,  1985. 

[19]  Markowsky,  G.,  Wesley,  M.A.,  “Fleshing  Out  Wire 
Frames,”  IBM  Journal  of  Research  and  Development  24. 
(5),  1980. 

[20]  Mayhew,  J.E.,  et  al..  Keynote  address,  IEEE  Conf.  on 
Computer  Vision  and  Pattern  Recognition,  June  1986. 


[2 1 1  Nguyen.  V..  “A  Parallel  Algorithm  for  Labelling  Po¬ 
lyhedral  Images,"  submitted  to  I JC A 187 ,  1987 

[22!  Roberts.  L.G..  "Machine  Perception  of  Three- 
Dimensional  Solids,"  Optical  and  Electro-Optical  Infor¬ 
mation  Processing,  J.T  Tippett  el  al.  (ed.).  MIT  Press. 
Cambridge,  Mass.,  1965. 

(23l  Strat.  T  .  Spatial  Reasoning  From  Line  Drawings  of 
Polyhedra.  Proc.  IEEE  lhS4  Workshop  on  Computer  Vi¬ 
sion  Representation  and  Control. 

[24]  Sugihara,  K.  “An  Algebraic  Approach  to  Shape  From 
Image  Problems,”  Artificial  Intelligence  23,  p.  59.  1984. 

[25l  Thompson.  D.  and  Mundy.  J.,  “Three-Dimensional 
Model  Matching  From  an  Unconstrained  Viewpoint," 
Proc.  IEEE  Conf.  on  Robotics  and  Automation.  April 
1987. 

[261  van  der  Waerden,  B.L.,  Modern  Algebra,  Vol.  I  and  II, 
Frederick  Ungar  Publishing  Co.,  New  York,  1966. 

[27]  Waltz,  D.L..  “Generating  Semantic  Descriptions  from 
Drawings  of  Scenes  with  Shadows,"  The  Psychology  of 
Computer  Vision.  P.H.  Winston  (ed).  McGraw  Hill. 
New  York,  1972. 


807 


i 


J 


ai/M'B?  17:58:23  OTUK 


Figure  5.2.  Part  of  a  Grobner  basis  input. 


809 


SPECTRAL  AND  POLARIZATION  STEREO  METHODS 
USING  A  SINGLE  LIGHT  SOURCE 

Lawrence  Brill  Wolff 


Department  of  Computer  Science 
Columbia  University 
New  York,  NY  10027 


ABSTRACT 


Proposed  herein  are  novel  stereo  techniques  for  the 
determination  of  surface  orientation  from  a  single  view  and  a  single 
light  source  exploiting  the  physical  wave  properties  of  light  and  the 
basic  physics  of  material  surfaces.  Conventional  photometric  stereo 
techniques  consider  the  intersection  of  equi-reflectance  curves  that 
arise  in  gradient  space  while  varying  the  imaging  geometry  with 
respect  to  the  spatial  position  of  a  light  source.  Spectral  and 
polarization  stereo  methods  consider  the  intersection  of  equi- 
reflectance  curves  in  gradient  space  while  varying  the  wavelength 
(i.e.  color)  and/or  the  polarization  or  light  emanating  from  a  single 
light  source  while  leaving  the  imaging  geometry  invariant.  The 
reflectance  function  used  here  results  from  an  extended  form  of  the 
Torrance-Sparrow  model  which  is  dependent  upon  the  wavelength 
and  polarization  of  incident  light  as  well  as  various  physical 
parameters  of  the  material  surface.  This  reflectance  model  is  more 
accurate  than  commonly  used  Lambertian  reflectance  models  and  is 
applicable  to  a  wide  variety  of  isotropically  rough  surfaces  ranging 
from  metals  to  paper.  Computer  simulations  of  the  intersections  of 
equi-reflectance  curves  that  arise  by  varying  incident  wavelength  and 
polarization  are  shown  for  two  different  types  of  materials;  a 
dielectric,  Magnesium  oxide  (used  in  white  paint)  and  a  conductor. 
Aluminum.  Two  different  methods  called  direct  and  indirect 
resolution  of  specular  and  diffuse  reflection  components  are  used  to 
minimize  measurement  error  and  to  resolve  intersections, 
respectively. 

An  interesting  problem  arises  for  stereo  methods  that  vary  only 
wavelength  and/or  polarization  of  the  incident  light  source. 
Measurement  of  surface  orientations  that  do  not  lie  on  the  source¬ 
viewing  axis  line  in  gradient  space  determined  by  the  origin  and  the 
point  representing  the  incident  orientation  of  the  single  light  source 
cannot  be  uniquely  resolved  any  better  than  up  to  a  two  point 
ambiguity.  This  results  from  an  inherent  "flip"  symmetry  induced  on 
each  equi-reflectance  curve  in  gradient  space  by  the  isotropic 
reflectance  function  being  used.  It  is  demonstrated  that  it  is  feasible 
to  break  this  inherent  symmetry  by  performing  spectral/polarization 
stereo  methods  using  a  single  extended  light  source  with  a  variable 
"asymmetric"  aperature.  This  makes  it  possible  to  obtain  a  unique 
measurement  for  surface  orientations  not  on  the  source-viewing  axis 
using  only  a  single  light  source. 


1.  INTRODUCTION 

Previous  stereo  methods  such  as  the  traditional  stereo  method 
of  depth  determination  by  parallax  and  the  photometric  stereo 
method  for  the  determination  of  local  surface  orientation  (|Woodham 


1980])  entail  taking  successive  images  while  varying  some  aspect  of 
the  imaging  geometry.  Recently  it  was  repotted  in  [Wolff  1986]  that 
stereo  vision  can  be  generally  formalized  far  beyond  previously  used 
methods  by  taking  successive  images  while  varying  physical 
parameters  of  an  imaging  system  other  than  just  the  imaging 
geometry.  Spectral  and  polarization  stereo  methods  used  to 
determine  surface  orientation  are  a  particular  example  of  a  class  of 
stereo  methods  predicted  by  this  general  stereo  model. 

The  stereo  methods  presented  here  involve  using  the  familiar 
imaging  setup  as  depicted  in  figure  1  with  a  single  light  source 
incident  on  a  material  surface.  Standard  gradient  space  will  be  used 
throughout  to  represent  surface  orientation.  Spectral  stereo  entails 
varying  the  incident  wavelength  alone  between  successive  images. 
Polarization  stereo  entails  varying  the  incident  polarization  alone 
between  succesive  images.  Spectral  and  polarization  stereo  can  be 
combined  by  varying  both  the  wavelength  and  polarization 
simultaneously  between  successive  images.  Spectral  stereo  can  be 
implemented  by  using  various  narrow  pass  monochrome  color  filters 
and  polarization  stereo  can  be  implemented  using  various  Polaroid 
filters.  All  other  physical  parameters  governing  the  imaging  system 
are  to  remain  unchanged.  As  in  conventional  photometric  stereo, 
local  surface  orientation  is  determined  by  measuring  the  image 
irradiance  value  at  the  pixel  corresponding  to  the  projected  object 
point  giving  rise  to  an  equi-reflectance  curve  in  gradient  space.  The 
projection  function  of  an  object  onto  the  image  plane  is  assumed  to 
be  orthographic. 

Because  of  its  simplicity,  the  Lambertian  reflectance  function 
is  used  in  numerous  applications  in  low  level  vision.  However  die 
reflection  of  light  off  a  material  surface  is  a  highly  complex  process 
dependent  on  the  imaging  geometry,  the  wave  characteristics  of  die 
incident  light  radiation,  the  micro-geometry  of  the  material  surface 
(e.g.  roughness)  and  the  internal  physics  (e.g.  electric  resistivity)  of 
the  surface  material  itself.  These  aspects  of  die  reflection  process 
beyond  the  imaging  geometry  account  for  why  the  reflective 
properties  of  most  material  surfaces  deviate  from  the  Lambertian 
model. 


The  motivation  for  using  the  extended  Torrance-Sparrow 
reflection  model  (ETSRM)  is  twofold.  First,  as  just  mentioned,  the 
Lambertian  reflectance  model  is  too  simplistic  and  is  only  applicable 
to  a  very  narrow  class  of  materials  whereas  the  ETSRM  is  far  more 
general.  Second  is  that  the  Lambertian  model  is  not  dependent  at  all 
on  the  wavelength  or  polarization  of  incident  light  which  would 
make  the  stereo  methods  considered  here  infeasible.  While  the 
ETSRM  may  be  relatively  new  to  the  computer  vision  community,  it 
is  popularly  used  in  computer  graphics  for  rendering  realistic  images 
of  smooth  material  surfaces  [Cook,  Torrance  1981]. 


810 


To  demonstrate  the  wide  applicability  of  the  ETSRM,  and 
therefore  to  spectral/polarization  stereo  methods,  simulations  of  the 
measurement  of  surface  orientation  will  be  performed  on  two  very 
different  types  of  materials.  The  first  is  Magnesium  oxide  (MgO) 
classified  as  a  dielectric  which  is  a  material  that  does  not  conduct 
electricity.  MgO  has  a  strong  diffuse  component  of  reflection.  The 
second  material  is  Aluminum  which  is  a  conductor  and  has  a  strong 
specular  component  of  reflection.  As  will  be  seen,  the  distinction 
between  whether  a  material  is  a  dielectric  or  a  conductor  is  important 
for  selecting  the  appropriate  spectral/polarization  stereo  technique. 

The  reflectance  function  resulting  from  the  ETSRM  is 
isotropic  in  the  sense  that  the  reflected  radiance  from  a  point  on  a 
material  surface  is  independent  of  any  rotation  about  the  normal  to 
this  point.  This  induces  a  ‘'flip"  symmetry  on  the  resulting  equi- 
reflectance  curves  in  gradient  space.  A  formal  description  of  this  is 
given  in  section  7.  Because  the  imaging  geometry  is  left  invariant 
for  spectral/polarization  stereo  methods  the  equi-reflectance  curves 
produced  from  successive  images  possess  the  same  exact  "flip" 
symmetry.  This  makes  it  theoretically  impossible  to  measure  most 
surface  orientations  uniquely  beyond  a  two  point  ambiguity  using 
spectral/polarization  stereo  methods  with  a  single  point  light  source. 
A  technique  is  presented  to  break  this  "flip"  symmetry  by  using  a 
single  extended  light  source  subtending  a  solid  angular  area  which 
does  not  possess  the  same  "flip"  symmetry. 

Unless  stated  otherwise,  all  light  sources  mentioned  are 
assumed  to  be  point  light  sources.  For  brevity,  the  term 
source-viewing  plane  in  this  paper  refers  to  the  plane  determined  by 
the  incident  source  orientation  vector  (represented  by  (Ps,  Q,,  -1)  ) 
and  the  viewing  orientation  vector  (assumed  to  be  (0,  0,  -1)  ).  The 
source-viewing  axis  is  the  line  in  gradient  space  determined  by  the 
origin  and  the  point  (P,,  Qj). 


the  Lambertian  reflectance  function  which  accounts  for  multiple 
specular  reflections  and/or  reflections  of  light  rays  that  penetrate  into 
the  skin  of  the  surface  and  then  reflect  back  out.  According  to  the 
Torrance-Sparrow  reflection  model  the  form  of  the  reflected  radiance 
function  is  given  by: 

(1)  dNr*gNiRida>i  +  NiRdd to,.  . 

The  terms  R,  and  Rd  are  the  functions  for  the  specular  and  diffuse 
components  of  reflection  respectively.  The  term  Nj  represents  tire 
incident  radiance  of  the  light  source  through  the  infinitesimal  solid 
angle  dco,  (see  figure  2)  and  g  is  the  propoition  of  the  specular 
reflectance  relative  to  the  diffuse  reflectance.  The  equivalence 
symbol  in  equation  (1)  and  all  other  equations  in  this  paper 
represents  proportionality.  The  specular  and  diffuse  component 
reflection  functions  are  given  by: 

*  .gaga mFm 

’  COS& 

Rd  -  cos'V 

where  tire  angular  arguments  correspond  to  figure  2.  Note  that  Rd  is 
the  Lambertian  reflectance  function.  Because  of  its  central  role  in 
spectral/polarization  stereo  applications,  section  3  is  entirely  devoted 
to  the  function  F(*P’,T|). 

The  function  P(a)  is  the  probability  distribution  function  for 
die  orientations  of  the  normals  to  the  planar  microfacets.  The 
probability  distribution  function  proposed  in  [Torrance,  Sparrow 
1967]  is: 


2.  THE  EXTENDED  TORRANCE-SPARROW  REFLECTION 
MODEL 


P(a)  ■  exp(~  (ac)2  . 

The  variable  a  is  the  angle  between  the  normal  to  a  given  microfacet 
and  the  normal  to  the  surface  and  is  depicted  in  figure  2.  The  term  c 
is  an  ad  hoc  constant  which  is  determined  empirically  by  solving 
equation  (1)  using  observed  reflected  radiance  values. 


It  is  suggested  in  [Cook,  Torrance  1981]  that  a  probability 
distribution  function  for  the  orientation  of  microfacet  normals  using 
physical  parameters  (i.e.  no  ad  hoc  constants)  could  be  used  to 
emulate  the  Beckmann  distribution  for  mean  scattered  power  of  light 
reflected  from  a  rough  surface.  This  function  is  derived  in 
[Beckmann,  Spizzichino  1963]  and  is  given  by: 


The  reflectance  model  reported  in  [Torrance,  Sparrow  1967] 
assumes  that  every  surface  has  a  microscopic  level  of  detail  which 
consists  of  a  statistically  large  number  of  perfectly  smooth  planar 
microfacets.  These  microfacets  are  oriented  according  to  a  particular 
probability  distribution  ftinction  of  the  angle  between  the  normal  to 
each  microfacet  and  the  normal  to  the  surface.  This  implies  that  the 
surface  is  isotropically  rough  about  the  normal  to  the  surface  as  the 
probability  distribution  function  is  not  dependent  on  the  azimuthal 
orientation  of  the  normal  to  a  microfacet 


The  Torrance-Sparrow  reflectance  model  is  primarily  based  on 
geometric  optica.  Light  rays  are  assumed  to  reflect  off  a  material 
surface  with  both  a  specular  and  a  diffuse  component  The  specular 
component  of  reflection  accounts  for  when  a  light  ray  specularly 
reflects  off  a  microfacet  and  the  diffuse  component  is  described  by 


P(ct)  i 


1 


n?cos*a 


exp( 


tan2! 


m2 


ft. 


The  term  m  is  the  root  mean  square  slope  value  of  the  microfacet 
surface  normals.  Large  values  of  m  indicate  a  "rough"  surface  while 
small  values  of  m  indicate  "smooth"  surfaces.  The  extended 
Torrance-Sparrow  reflection  model  refers  to  incorporating  the 
Beckmann  disribution  function  into  the  expression  for  the  specular 
component  of  reflection  R,.  The  value  for  m  can  be  determined  by 
solving  equation  (1)  using  reflected  radiance  values,  or  can  be 
measured  more  directly  using  a  stylus  profilometer. 


811 


a) 

Figure  2 


The  reflection  model  presented  in  [Torrence,  Sparrow  1967] 
also  takes  into  account  the  mutual  shadowing  and  masking  of  surface 
microfacets  against  one  another.  It  is  assumed  that  each  specularly 
reflecting  microfacet  comprises  one  side  of  a  symmetric  V  groove  as 
depicted  in  figure  3.  The  process  of  shadowing  and  masking 
attenuates  the  total  fraction  of  reflecting  surface  area.  The 
proportional  amount  of  this  attenuation  is  called  the  geometric 
attenuation  factor  and  is  given  by: 


3.  THE  FRESNEL  REFLECTION  COEFFICIENT 

The  function  F('F’,r|)  incorporated  into  the  specular 
component  of  reflection  is  of  primary  importance  to  spectral  and 
polarization  stereo.  It  is  known  as  the  Fresnel  reflection  coefficient. 
This  is  the  only  part  of  expression  (1)  that  is  dependent  on  the 
wavelength  and  polarization  of  incident  light.  The  derivation  of 
F(*F’,ti)  treats  light  as  an  electromagnetic  wave.  The  polarization  of 
a  light  wave  is  defined  in  terms  of  how  the  electric  field  vector  is 
oriented  with  respect  to  the  plane  of  incidence,  the  plane  determined 
by  the  incident  and  specularly  reflected  vectors.  An  incident  light 
wave  is  parallel  polarized  if  its  electric  field  vector  is  parallel  to  the 
plane  of  incidence  and  is  perpendicularly  polarized  if  its  electric  field 
vector  is  perpendicular  to  the  plane  of  incidence. 

An  incident  light  wave  that  is  specularly  reflected  from  a 
perfectly  smooth  surface  results  in  a  partially  transmitted  wave  and  a 
partially  reflected  wave.  The  Fresnel  reflection  coefficient 
represents  the  proportional  attenuation  of  the  incident  light  energy 
per  unit  time  per  unit  area  for  specular  reflection  off  of  a  perfectly 
smooth  surface.  This  simulates  the  specularly  reflected  radiance  of 
light  rays  off  of  the  perfectly  smooth  microfacets. 

The  angle  f  is  the  specular  angle  of  incidence  of  a  light  ray 
upon  microfacets  with  surface  normals  oriented  parallel  to  the 
highlight  vector  depicted  in  figure  2.  The  specular  angle  of 
incidence  on  the  microfacets  of  a  surface  is  equal  to  1/2  of  the  phase 
angle  and  is  not  dependent  on  the  orientation  of  the  surface.  The 
number  of  microfacets  drat  the  incident  light  is  specularly  reflected 
from  is  determined  by  the  surface  orientation  and  the  probability 
distribution  function  P(a). 

The  term  q-n  -  iK  is  the  complex  index  of  refraction  for  die 
surface  material.  The  real  terms  n  and  K  are  dependent  upon  the 
wavelength  of  incident  light  A.0  according  to  the  formulas: 


G(T,e,<I>)  -  mm  {  1, 


2 cosacosS  2cosacos'V 
cos'V ’  '  cos'V 


}  • 


The  above  expressions  for  the  geometric  attenuation  factor  and 
the  Beckmann  distribution  function  conform  to  the  geometry 
depicted  in  figure  2.  With  respect  to  the  angles  T,  0  and  4>  depicted 
in  figure  2,  a  and  VF’  can  be  expressed  as: 


2  2tlcnr 


*K  -  ^coj'1  [  cosBcos'V  -  sinB  sin'V  cosdf  ] 
a  -  cos~l  [cos'V cos'V  +  sin'V sin'V cctfP  ] 

where 

|J-  sin~‘  [sin<t>sinQ/sin2'V]  . 


Figure  3 


The  term  c0  is  the  speed  of  light  in  a  vacuum,  re  is  the  electrical 
resistivity  of  the  surface  material,  and  o  and  y  ate  the  electrical 
permitivity  and  the  magnetic  permeability  of  the  surface  material 
respectively.  It  turns  out  that  re,  o  and  Y  are  dependent  upon  the 
surface  temperature  and  the  external  presence  of  electric  and 
magnetic  fields.  This  suggests  other  stereo  methods  to  determine 
surface  orientation.  For  instance,  vary  the  temperature  between  each 
successive  image,  or  for  a  conducting  material  vary  the  strength  of 
an  electrical  current  passing  through  it  between  successive  images. 
Varying  the  strength  of  magnetic  fields  can  also  be  used. 

The  term  K  is  called  the  coefficient  of  extinction.  The 
coefficient  of  extinction  is  zero  for  dielectrics  (because  re  is  infinite) 
and  is  non-zero  for  conductors  since  they  have  finite  electrical 
resistivity  re.  For  dielectrics,  the  term  n  is  the  simple  index  of 
refraction. 


812 


A  detailed  derivation  of  the  Fresnel  reflection  coefficient  is  Unpolarized  light  is  the  equi- superposition  of  parallel  and 

given  in  [Siegel,  Howell  1981].  It  is  represented  as  the  linear  perpendicular  states.  Therefore  the  Fresnel  reflection  coefficient  for 

superposition  of  die  Fresnel  reflection  coefficients  for  perpendicular  unpolarized  light  uses  s— t— 1/2. 
and  parallel  polarized  incident  light  waves  respectively.  That  is: 

Simulations  of  the  determination  of  local  surface  orientation 
Ff'F'.il)  -  sF  ('F'.iri  +  tF  f'F'.n)  s.ttQ  f  +  r-1.  will  be  performed  using  the  reflectance  maps  predicted  by  the 

ptrp  pan  ETSRM  for  MgO  and  Aluminum.  The  accuracy  of  the  original 

Torrance-Spanow  reflection  model  was  tested  on  these  two  materials 
In  turn  in  [Torrance,  Sparrow  1967]  with  good  results.  The  empirical  values 

for  g,  the  relative  proportion  of  specular  to  diffuse  component 
reflected  radiance,  were  also  obtained  for  MgO  and  Aluminum  in 
F  CV’  nl  -  **  - 1,2 -  2acos'¥'  1  cos2'V'  this  paper  (gMg0-2,  gA1— 7/8)  and  will  be  used  in  the  simulations  in 

pant-  ,T^  "  £  +  fp.  +  'hxcos'Y  +  cosi'Y  future  sections.  The  Fresnel  reflection  coefficients  for  MgO  and 

Aluminum  are  graphed  in  figures  4  (a)  thru  (d)  against  the  angle  of 
incidence  for  different  incident  light  polarizations  and  wavelengths, 
r-  a2  +  b2  -  lasin'Vtan'V  +  sir? 'V' tar? 'V'  ^  Be  aware  of  the  scale  on  the  vertical  axis.  The  values  for  the  indices 

‘‘pan'*  ,TD  ~  ^2  +  f,2  +  2asin'V'tan 'V  +  sin2 'V'tan2 of  refractions  were  obtained  from  [Physics  Handbook]  for  different 

wavelengths. 


where 

2a2-  [(n2-*2-ii/i2*P')2  +  4/i i2*2  ] 1 1/2  +  n2  -  k2  -  sin2  4'' 
2b2  -  [(/i2  -  k2  -  sin1' V)2  +  4/i2*2]1'2  -  (n2  -  b2  -rr/t2'P') 


Typical  human  vision  perceives  light  waves  with  a  wavelength 
between  400  and  700  nanometers  (NM).  "Vision"  can  theoretically 
exist  for  any  wavelength  of  electromagnetic  radiation.  The 
wavelengths  shown  in  figures  4  (a)  and  (b)  for  MgO  are  not  in  the 
visible  part  of  the  spectrum.  These  wavelengths  were  chosen  to  fall 


uMNMN-irtK-ee* 
IIMNHN-Itl  K<4M 


Figure  4(a) 


wai  or  ipcawNct  in  aadianj 


tv  OOAATTD  M.UMTA.V 
ttONM  V-4II  K>)« 

srt  NM  N-4  N  K-4  U 
’OONMN-t  UK-t  OB 

Figure  4(c) 


•  •»  t  it 

ANCUt  or  KSCNCt  M  KAMAM 


BVa/OftAtn  WMMUMHM  M 
mrOMOttO  TO  MTUCRD  UCHT  AT  M  NM 


Figure  4(b) 


Figure  4(d) 


as  close  to  the  visible  part  of  the  spectrum  as  possible  while  being 
able  to  obtain  the  indices  of  refraction  horn  the  tables  in  [i'hysics 
Handbook],  Interpolation  of  data  was  not  used  to  make  the 
simulation  of  equi-reflectance  curves  as  realistic  as  possible. 

An  interesting  feature  of  the  dielectric  MgO  is  that  it  has  a 
Brewster  angle  at  arc tan(n- 1.77)  -  60.5°  for  which  no  parallel 
polarized  light  is  reflected.  All  dielectrics  have  a  Brewster  angle  at 
the  arctangent  of  their  simple  index  of  refraction.  For  the 
determination  of  surface  orientation  by  varying  polarization  this  is  a 
very  important  angle  to  know  in  a  controlled  environment.  Accuracy 
of  measurement  can  be  substantially  enhanced  by  making  the  phase 
angle  approximately  equal  to  twice  the  Brewster  angle  for  a 
dielectric  material.  This  will  be  seen  in  section  5. 


As  can  be  seen  the  simple  index  of  refraction  for  MgO  has  a 
rather  weak  dependence  on  wavelength.  This  suggests  that  the 
combined  specular  and  diffuse  reflectance  will  be  altered  only 
slightly  for  different  incident  wavelengths  of  light.  In  section  6  a 
technique  will  be  presented  to  obtain  a  discemable  intersection  from 
specular  and  diffuse  component  equi-reflectance  curves. 

4.  ERROR  ANALYSIS  FOR  THE  INTERSECTION  OF 
REFLECTANCE  CURVES 

Implementation  of  spectral/polarization  stereo  methods 
involve  the  physical  measurement  of  image  irradiance  values  which 
have  associated  with  them  an  inherent  error.  This  in  turn  generates 
an  error  collar  about  each  equi-reflectance  curve.  Given  an  image 
irradiance  measurement  value  I  and  a  reflectance  function  R(p,q),  the 
image  irradiance  equation  gives: 


♦ 


gradient  of  the  reflectance  function  is  high  making  both  the  expected 
and  worse  case  errors  less  than  for  the  errors  associated  with  a 
perpendicular  intersection  of  equi-reflectance  curves  in  a  region  of 
gradient  space  where  the  gradient  of  the  reflectance  function  is  quite 
low. 


!  -  R(p,q) 


where  then 


j.  dR.  dR? 
5f-—6p  +  — 
dp  oq 


Optimal  criteria  for  the  overall  minimization  of  expected  and 
worst  case  errors  are  extremely  complex  and  are  dependent  on  a 
multitude  of  assumptions  including  a  priori  knowledge  about  the 
expected  range  for  the  surface  orientation  to  be  measittd. 
Simultaneous  maximization  of  perpendicularity  of  intersections  i!5l 
maximization  of  the  gradient  of  the  reflectance  function  over  a 
global  set  of  gradient  space  points  is  impossible  in  most  cases. 


Therefore,  for  a  given  inherent  error  of  measurement  in  I,  the  width 
of  the  error  collar  about  a  point  on  the  equi-reflectance  curve 
resulting  from  the  above  image  irradiance  equation  will  decrease  as 
the  gradient  of  the  reflectance  function  R(p,q)  increases  at  this  point. 

Figure  5(a)  depicts  a  nearly  perpendicular  intersection  of  two 
equi-reflectance  curves  along  with  their  associated  error  collars.  The 
curved  quadrilateral  region  which  is  cross  hatched  represents  the  area 
in  which  the  true  surface  orientation  lies.  Without  referring  to  any 
particular  probability  distribution  function  defined  on  this  cross 
hatched  area,  the  expected  error  of  measurement  of  surface 
orientation  derived  from  this  intersection  can  be  qualitatively 
interpreted  as  the  area  of  the  cross  hatched  region  while  the  worst 
case  error  can  be  qualitatively  interpreted  as  the  maximum  distance 
between  two  points  in  the  cross  hatched  region. 


A  good  strategy  to  increase  the  gradient  of  the  reflectance 
function  is  to  increase  the  phase  angle.  This  involves  a  tradeoff  as 
increasing  the  phase  angle  increases  the  area  of  gradient  space  that 
lies  in  shadow.  A  good  strategy  for  optimizing  perpendicular 
intersections  for  spectral/polarization  methods  using  the  ETSRM  is 
to  pick  incident  wavelengths  and  polarizations  that  will  produce 
intersections  of  specular  and  diffuse  component  equi-reflectance 
curves.  This  is  indicated  in  figure  6  which  shows  an  overlay  of  the 
specular  and  diffuse  component  reflection  functions  for  MgO.  Near 
perpendicular  intersections  occurr  almost  everywhere  in  the  first 
quadrant  for  a  light  source  with  incident  orientation  (7.0,  3.0,  -1). 
The  same  overlay  of  specular  and  diffuse  reflectance  maps  would 
result  for  Aluminum  except  that  reflected  radiance  values 
corresponding  to  each  equi-reflectance  contour  would  be  different 
than  from  the  ones  in  figure  6. 


Figure  5(b)  depicts  a  more  oblique  intersection  of  two  equi- 
reflectance  curves  along  with  their  associated  error  collars. 
Assuming  that  the  gradients  of  the  reflectance  functions  within  die 
cross  hatched  regions  are  roughly  equal  between  figures  5  (a)  and 
(b),  an  oblique  intersection  increases  both  the  expected  and  the  worst 
case  errors  for  the  measurement  of  surface  orientation  from  the 
intersection  of  equi-reflectance  curves.  Sometimes  an  oblique 
intersection  may  occurr  in  a  region  of  gradient  space  where  the 


It  turns  out  that  direct  resolution  of  specular  and  diffuse 
components  for  any  dielectric  is  possible  using  polarization  stereo 
for  selected  phase  angles  close  to  twice  the  Brewster  angle.  This  is 
described  in  section  5.  However,  as  reflection  of  light  from 
dielectrics  is  vety  insensitive  to  variations  in  wavelength,  the 
intersection  of  equi-reflection  curves  for  even  vast  shifts  in 
wavelength  are  so  oblique  that  the  equi-reflectance  curves 
themselves  are  almost  indistinguishable.  Even  though  reflection  of 


814 


p,-;o  <5.. jo 

MACXESH.M  OXIDE  N.l  rr  K.0  00 


Figure  6 

tight  from  conductors  has  a  stronger  dependence  on  wavelength  and 
a  moderate  dependence  on  polarization  (see  figures  4  (c)  and  (d)), 
this  problem  also  occuns  for  spectral/polarization  stereo  methods 
applied  to  conductors.  These  problems  are  so  bad  that  minimization 
of  error  is  surmounted  by  the  problem  of  ascertaining  an  intersection 
in  the  first  place.  It  is  required  therefore  to  use  a  methodology 
described  in  section  6  called  indirect  resolution  of  specular  and 
diffuse  reflection  components. 

5.  POLARIZATION  STEREO  FOR  DIELECTRICS 


An  example  of  this  technique  is  shown  in  figure  7  for  MgO. 
Unfortunately  one  disadvantage  is  that  the  Brewster  angle  for  MgO 
is  60.5°  which  would  make  the  phase  angle  equal  to  about  121°.  This 
would  leave  much  of  gradient  space  in  shadow.  Figures  11  (a) 
through  (c)  use  an  incident  source  orientation  vector  of  (7.0,  3.0,  -1) 
which  gives  a  phase  angle  of  approximately  82.6°.  The  specular 
angle  of  incidence  is  therefore  about  41.3°,  nearly  20°  from  the 
Brewster  angle  for  MgO.  Still,  strong  intersections  are  noted  in  the 
overlay  of  the  parallel  polarized  and  the  perpendicularly  polarized 
reflectance  maps.  This  is  due  to  the  fact  that  the  Fresnel  reflection 
coefficient  for  parallel  polarized  light  is  still  negligable  at  this  angle 
of  incidence  and  that  the  Fresnel  reflection  coefficient  for 
perpendicularly  polarized  light  is  approximately  4  times  that  for 
parallel  polarized  light  (see  figure  4(b),  41.3°  is  about  0.720  radians). 


Figure  8  depicts  an  actual  simulation  of  the  determination  of 
local  surface  orientation  from  polarization  stereo  for  a  point  on  a 
sphere  with  a  Magnesium  oxide  surface.  The  sphere  is  assumed  to 
have  radius  23  and  the  surface  orientation  is  to  be  determined  for  the 
point  on  the  sphere  corresponding  to  the  image  point  (12,18).  The 
equation  for  the  visible  part  of  a  sphere  of  radius  r  using  the 
coordinate  system  depicted  in  figure  l.b  is: 


which  implies  that 


3 1  x  dz  y 

**’&-*-/ 

Therefore,  the  local  surface  orientation  at  the  point  corresponding  to 
the  image  coordinate  (12,18),  assuming  orthographic  projection,  is 
approximately  (0.958, 1.437,  -1)  in  gradient  space  representation. 


It  was  mentioned  in  section  3  that  for  all  dielectrics  there  exists 
an  angle  of  incidence,  called  the  Brewster  angle,  at  which  parallel 
polarized  light  is  not  reflected.  That  is,  the  Fresnel  reflection 
coefficient  for  parallel  polarized  light  at  the  Brewster  angle  is  zero. 
Figure  4(b)  shows  that  in  fact  the  Fresnel  reflection  coefficient  for 
parallel  polarized  light  is  small  over  a  large  range  of  angles  of 
incidence  less  than  the  Brewster  angle.  By  observing  the  expression 
for  R,  in  section  2,  this  phenomenon  can  be  exploited  to  "scale"  the 
specular  component  of  reflection  to  nearly  zero  thus  isolating  the 
diffuse  component  of  reflection  Rd. 


Suppose  that  two  image  irradiance  measurements  are  taken  for 
parallel  polarized  and  perpendicular  polarized  incident  light 
respectively  and  that  the  phase  angle  is  equal  to  exactly  twice  the 
Brewster  angle  for  a  given  dielectric.  Therefore  the  specular  angle  of 
incidence  of  light  on  the  microfacets  will  be  exactly  the  Brewster 
angle  itself.  The  first  equi-reflectance  curve  resulting  from  the 
incident  parallel  polarized  light  will  be  exactly  the  same  as  for  the 
Lambertian  model  at  this  angle  of  incidence.  Since  a  moderate 
amount  of  perpendicular  polarized  light  is  reflected  at  the  Brewster 
angle  for  a  dielectric,  the  second  equi-reflectance  curve  will  have  a 
significant  specular  component  of  reflection.  Even  though  the  second 
equi-reflectance  curve  is  not  a  pure  specular  component  of  reflection, 
it  will  tend  to  be  a  curve  that  will  have  a  strong  intersection  with  the 
Lambertian  reflectance  curve.  This  technique  of  forcing  the  total 
reflectance  curves  significantly  towards  the  specular  and  diffuse 
component  reflection  functions  respectively  is  what  is  termed  direct 
resolution  of  specular  and  diffuse  components. 


OVERLAY  OF  REFLECTANCE  MAPS  FOR  PERPENDICULAR  AND  PARALLEL  POLARIZATION 
REFLECTANCE  FUNCTION  0  S'(  2  0*R*  -  Rd  ) 

SOURCE  VECTOR  ORIENTATION  P»«7  0  Qs-3  0 
MAGNESIUM  OXIDE  N-l  77  K-0  00(3«S  nm) 

rms  slope  value  -  2  0 
Figure  7 


815 


The  equi-reflectance  curves  in  figure  8  result  from  three 
different  incident  polarizations  respectively.  It  is  clear  that  an 
intersection  occurrs  at  the  correct  orientation  point.  This  intersection 
could  be  obtained  by  using  the  two  equi-reflectance  curves  for 
parallel  polarization  and  perpendicular  polarization  respectively. 
However  there  is  an  additional  intersection  point  that  could  just  as 
well  be  the  true  value  of  the  surface  orientation.  The  third  equi- 
reflectance  curve  produced  from  unpolarized  incident  light  does  not 
provide  more  information  with  respect  to  resolving  between  these 
two  intersection  points.  In  fact,  for  this  same  configuration  of  the 
imaging  system,  all  equi-reflectance  curves  produced  from  any 
incident  polarization  or  wavelength  will  go  through  these  same  two 
points  in  gradient  space.  This  is  due  to  an  inherent  "flip"  symmetry 
about  the  source-viewing  axis  that  cannot  be  broken  by  varying  the 
value  of  the  Fresnel  reflection  coefficient,  F(4',,T|). 

It  is  possible  to  break  this  "flip"  symmetry  by  using  a  second 
light  source  which  has  an  incident  orientation  not  lying  on  the 
source-viewing  axis  of  the  original  light  source.  This  would  produce 
an  equi-reflectance  curve  going  through  the  true  surface  orientation 
point  and  not  going  through  the  second  intersection  point  observed 
for  polarization  stereo.  However,  the  "flip"  symmetry  inherent  to 
spectral/polarization  stereo  can  also  be  broken  by  using  a  single 
extended  light  source  with  a  variable  aperature.  This  will  be 
discussed  in  section  7. 

6.  SPECTRAL  AND  POLARIZATION  STEREO  BY 
INDIRECT  RESOLUTION  INTO  SPECULAR  AND  DIFFUSE 
REFLECTION  COMPONENTS 


The  equi-reflectance  curves  in  figures  9  and  10  attempt  to 
measure  the  same  surface  orientation  in  the  problem  that  is  described 
in  section  5.  For  figure  10,  the  surface  of  the  sphere  is  assumed  to  be 
made  of  Aluminum.  By  observing  figures  9  and  10  it  is  evident  that 
directly  obtaining  an  intersection  point  for  reflectance  curves  for 
spectral  stereo  applied  to  dielectrics,  or  for  any  combination  of 


INTERSECTION  Of  EQUI-REFLECTANCE  CURVES  FOR  DIFFERENT  POLARIZATIONS 


COMBINED  SPECULAR  AND  DIFFUSE  REFLECTANCE  MAP 
REFLECTANCE  FUNCTION  0  S'{  3  0‘R.  -  Rd  I 
SOURCE  VECTOR  ORIENTATION  Pi-7  0  Q,-.l  0 
MAGNESIUM  OXJDEN  .  I  77  K  -0  00  (3«S  NM| 

RMS  SLOPE  VALUE-  3 0 

Figure  8 


COMBINED  SPECULAR  AND  DIFFUSE  REFLECTANCE  MAP 
REFLECTANCE  FUNCTION  0  S’!  I  0-R,  *  Rd  ) 

SOURCE  VECTOR  ORIENTATION  P>-7  0  Q<-3  0 
MAGNESIUM  OXIDE  N.l  77  K-OOO  (3«V  NM|  AND  N-L  73  K-0  0  (1130  NM) 
RMS  SLOPE  VALUE-  3.0 

Figure  9 


INTERSECTION  OF  EQUI-REFLECTANCE  CURVES 
FOR  SIMULTANEOUS  VARIATIONS  IN  WAVELENGTH  AND  POLARIZATION 


COMBINED  SPECULAR  AND  DIFFUSE  REFLECTANCE  MAP 
REFLECTANCE  FUNCTION  0  J-(  0  «7S'R»  .  Rd  ) 
SOURCE  VECTOR  ORIENTATION  P.-7  0  Qi-3  0 
ALUMINUM 

N-0SI  K-J  00  («0  NM)  PERPENDICULAR  POLARIZATION 
N-l  SS  K-7 ,00(700  NM)  PARALLEL  POLARIZATION 
RMS  SLOPE  VALUE-  >  0 

Figure  10 


816 


spectral  and  polarization  stereo  applied  to  conductors,  is  very 
difficult  Spectral  stereo  applied  to  dielectrics  is  hard  to  implement 
because  the  simple  index  of  refraction  is  very  insensitive  to  changes 
in  wavelength.  Unless  the  sampling  rate  for  the  equi-reflectance 
curves  in  gradient  space  is  very  high,  two  curves  obtained  for 
different  incident  wavelengths  will  appear  indistinguishable.  Another 
problem  for  spectral  stereo  applied  to  dielectrics  is  the  necessity  for 
low  measurement  error  in  the  image  irradiance.  The  shift  in  the 
observed  reflected  radiance  between  the  two  equi-ieflectance  curves 
in  figure  9  is  approximately  3.0% . 


The  observed  image  irradiance  shift  for  the  two  equi- 
ieflectance  curves  in  figure  10  is  fairly  significant  being  almost  9%  . 
The  problem  of  indistinguishability  between  equi-ieflectance  curves 
here  is  even  worse  than  in  figure  9.  This  is  due  to  the  relative 
invariance  of  the  shape  of  the  equi-ieflectance  curves  for  Aluminum 
for  fairly  significant  changes  in  the  reflected  radiance. 


A  technique  is  presented  here  that  will  use  two  reflected 
radiance  measurements  to  solve  for  die  values  of  the  specular  and 
diffuse  components  of  reflection  respectively  from  two  simultaneous 
equations.  The  functions  R,  and  Rd  are  the  same  as  those  defined  in 
section  2  except  that  the  R,  used  here  is  gRs.  Define  r,  such  that 
Rl-F('P’,q)r, .  Suppose  that  two  reflectance  measurements  Iref  1  and 
Ijgf2  are  taken  from  two  sucessive  images  at  the  same  pixel.  Also,  the 
value  of  the  Fresnel  term  with  respect  to  this  pixel  for  the  first  and 
second  image  are  F'('F’,ti)  and  F2(¥’,ti)  respectively.  Then  the 
following  equations  arise: 


+  ±t 


F2(r,TDri  +  R,-f„;±E 


where  e  is  the  error  inherent  in  the  reflectance  measurement  The 
solutions  for  R,  and  Rj  are: 


F'f'F'.q) - ldL — _ ±  2  £l£*l c 

Fk'F'.n)  -  F2<r,n)  f'f'F'.n)  -  F*(r.ii) 


f'W.v)  -  F^'F'.q)  F'fr.tu  -  F2(r,iu e 


Note  that  the  solution  for  R,  is  for  the  specular  component  of 
the  total  reflectance  measured  in  the  first  image.  This  technique  of 
decomposing  the  value  of  a  total  reflectance  function  into  its 
specular  and  diffuse  component  values  is  what  is  termed  indirect 
resolution  of  specular  and  diffuse  components. 


Thus,  even  for  small  shifts  in  the  Fresnel  reflection  coefficient, 
two  measurements  of  the  reflected  radiance  make  it  possible  to 
obtain  the  specular  and  diffuse  component  reflection  values  for  each 
measurement.  These  values  in  turn  can  be  used  to  generate  an  equi- 
ieflectance  curve  for  the  specular  and  diffuse  component  respectively 
which  produces  strongly  detectable  intersections  as  mentioned  in 
section  4.  While  the  detection  of  intersection  problem  may  be  solved, 
the  associated  error  collars  may  be  huge  if  the  experimental 
measurement  error  e  is  significant  and  the  difference  between  the 
first  and  second  Fresnel  reflection  coefficients  is  very  small.  This  can 
be  seen  from  the  equations  immediately  above. 


Figures  11  and  12  show  the  intersection  of  specular  and 
diffuse  component  equi-reflectance  curves  obtained  from  the  same 
reflectance  measurements  that  generated  the  curves  in  figures  9  and 
10,  respectively.  Of  course  the  experimental  error  e  for  this 


RESOLVED  SPECULAR  AND  DIFFUSE  COMPONENTS 
SOURCE  VECTOR  ORIENTATION  Pi.T  0  Q.-J  0 
MAGNESIUM  OXIDE  N.l  77  K-0  00  [Ml  NM) 

SECOND  REFLECTANCE  MEASUREMENT  OBTAINED  AT  N-l IS  K-0  00  (I  IH  NM) 
UNPOLARIZED  LICHT 
RMS  SLOPE  VALUE-  10 

Figure  11 


RESOLVED  SPECULAR  AND  DIFFUSE  COMPONENTS 
SOURCE  VECTOR  ORIENTATION  P,-I  0  Q,- J  0 
ALUMINUM  N-0  It  X-S 00  (ISO  NM)  PERPENDICULAR  POLARIZATION 
SECOND  REFLECTANCE  MEASUREMENT  OBTAINED  AT  N-l  55  K-T  00(100  NM)  PARALLEL  POLARIZATION 
RMS  SLOPE  VALUE—  1 0 

Figure  12 


817 


simulation  is  assumed  to  be  0.  The  fact  that  the  curves  in  figures  1 1 
and  12  are  exactly  the  same  should  be  no  surprise  as  equi-reflectance 
curves  are  invariant  with  respect  to  change  in  scale.  The  Fresnel 
reflection  coefficient  merely  scales  the  specular  component  of 
reflection.  Only  the  observed  specular  reflected  radiance  is  different 
between  the  two  figures. 


source-viewing  plane.  The  orientation  component  8  is  assumed  to 
be  the  same  for  both  normals.  The  reflected  radiance  is  the  same  for 
both  surface  orientations  because  the  parameters  "F,  *F’,  8  and  a  are 
automatically  the  same.  It  is  clear  that  expression  (1)  in  section  2  will 
yield  the  same  value  under  this  equivalence.  This  results  by 
observing  that  the  specular  and  diffuse  component  reflected  radiance 
are  each  left  invariant. 


The  same  "flip"  symmetry  that  was  discussed  in  section  5  is 
also  inherent  to  the  individual  functions  R,  and  Rd.  Additional 
reflectance  measurements  from  different  incident  wavelengths  and 
polarizations  would  not  resolve  the  two  point  intersection  ambiguity. 
To  achieve  unique  measurement  of  surface  orientation,  the  technique 
presented  in  section  7  using  an  extended  light  source  will  have  to  be 
implemented. 


Spectral  stereo  for  dielectrics  for  determining  surface 
orientation  is  not  generally  recommended.  Errors  in  the  measurement 
of  surface  orientation  for  conductors  using  spectral/polarization 
methods  can  be  reduced  by  simultaneously  varying  the  wavelength 
and  polarization  as  is  done  in  figures  10  and  12.  The  shift  in  the 
Fresnel  reflection  coefficients  from  the  first  to  the  second  image  can 
be  maximized  by  using  parallel  incident  polarization  at  a  high 
wavelength  and  then  using  perpendicular  polarization  at  a  low 
wavelength. 

7.  SYMMETRY  BREAKING  USING  A  SINGLE  EXTENDED 
LIGHT  SOURCE 


The  source -viewing  axis  in  gradient  space  is  the  line  at  the 
constant  angle  tan'^Pj/Q^  from  the  q-axis.  The  invariance  of 
reflected  radiance  about  the  source-viewing  plane  described  above 
implies  that  separate  points  that  are  equi-distant  from  the  origin  of 
gradient  space  and  that  make  the  same  angle  with  the  source-viewing 
axis  must  lie  on  the  same  equi-reflectance  curve.  This  can  be 
expressed  succintly  as  the  invariance  of  all  equi-reflectance  curves 
under  the  "flip"  symmetry  transformation: 


<X>  ->  2  tan\PjQs)  -  <P 


where  the  angle  <D  is  measured  clockwise  from  the  q-axis.  Doing 
some  algebra  using  equations  (2)  above,  this  transformation  can  be 
expressed  in  terms  of  gradient  space  variables  as: 


P  -> 


+  2 


(3) 


The  "flip"  symmetry  about  the  source-viewing  axis  that  exists 
for  the  equi-reflectance  curves  produced  by  the  ETSRM  can  most 
easily  be  shown  by  equating  the  gradient  space  representation  (p,  q, 
-1)  with  surface  orientation  represented  in  the  spherical  coordinate 
system  depicted  in  figure  2  (b)  with  the  z-axis  being  the  viewer 
orientation.  That  is. 


— (p.q.-l  1  =  ( sinQsinQ),  sinQcos<t>,  -cos& ) . 
V 1  +p2  +  q2 

This  implies  that: 

p  =  rsin<t> 

(2) 


<7 


,  Qf,  p,2-Q,2 

p}+Q}P 


Since  the  shift  in  the  value  of  the  Fresnel  reflection  coefficient 
does  not  alter  the  "flip"  symmetry  of  the  specular  reflectance 
function  during  the  process  of  spectral  and/or  polarization  stereo 
using  a  single  point  light  source,  it  is  clear  that  a  two  point  ambiguity 
will  always  exist  for  the  measurement  of  surface  orientations  not  on 
the  source-viewing  axis.  From  equations  (3)  it  can  be  deduced  that 
the  same  "flip"  symmetry  does  not  hold  separately  for  a  second  light 
source  with  incident  orientation  (PS’,QS’,-1)  if  and  only  if  (PS,QS)  and 
{PS’,QS’)  are  linearly  independent.  That  is,  a  second  light  source  will 
have  a  different  "flip"  symmetry  if  and  only  if  its  incident  orientation 
does  not  lie  on  the  source-viewing  axis  of  the  first  light  source. 


q  =  rcos<t> 

where: 


r  =  '!p2  +  qi~tan&  . 

This  demonstrates  that  the  points  (p,q)  in  gradient  space  can  be 
identified  with  surface  orientation  using  spherical  coordinates  via  the 
gnomonic  projection  of  the  unit  Gaussian  sphere  onto  the  tangent 
plane  at  the  north  pole.  Points  on  the  unit  Gaussian  sphere 
representing  surface  orientation  are  projected  onto  this  plane  by  rays 
which  emanate  from  the  center  of  the  sphere  and  go  through  the 
respective  point. 

The  condition  dial  the  probability  distribution  of  microfacets, 
P(a),  be  isotropic  implies  that  there  is  a  mirror  symmetry  with 
respect  to  the  invariance  of  reflected  radiance  for  surface  orientation 
normals  that  differ  azimuthally  in  <P  by  the  same  angle  about  the 


Suppose  now  that  a  second  light  source  not  lying  on  the 
source-viewing  axis  of  the  first  light  source  was  incident 
simultaneously  with  the  first.  That  is  the  reflected  radiance  is  now 
measured  while  both  light  sources  are  on.  The  "flip”  symmetry  of 
the  first  light  source  would  effectively  be  broken  for  the  resulting 
equi-reflectance  curves.  If  at  least  a  two  point  ambiguity  existed  in 
measuring  surface  orientation  for  all  incident  wavelengths  and 
polarizations,  the  symmetry  producing  the  ambiguity  would  not  be 
the  original  "flip”  symmetry.  Such  symmetries  may  occurr,  but  are 
highly  unlikely  for  the  measurement  of  most  surface  orientations. 


Suppose  that  n  point  light  sources  are  now  simultaneously 
incident  on  a  material  surface.  It  can  be  shown  using  a  simple 
geometric  argument  that  the  resulting  equi-reflectance  curves  possess 
the  "flip"  symmetry  of  one  of  the  n  light  sources  being  used  if  and 
only  if  the  configuration  of  the  incident  orientation  points  in  gradient 
space  of  the  n  light  sources  is  in  turn  "flip”  symmetric  about  the 
source-viewing  axis  of  one  of  the  n  light  sources.  A  single  extended 
light  source  can  be  viewed  as  a  continuous  distribution  of  a  large 
number  of  point  light  sources.  Therefore  the  equi-reflectance  curves 


818 


produced  from  a  single  extended  source  will  possess  a  "flip" 
symmetry  if  and  only  if  die  area  in  gradient  space  subtended  by  the 
continuous  distribution  of  incident  orientation  points  is  in  turn  "flip" 
symmetric. 

Figure  13  shows  the  unique  measurement  of  the  surface 
orientation  for  the  problem  in  section  5  for  a  MgO  surface.  Three 
reflected  radiance  measurements  were  taken  altogether.  Two  of  them 
used  unpolarized  and  perpendicular  polarized  incident  light  from  a 
single  point  source.  The  third  reflected  radiance  measurement  was 
taken  for  perpendicular  incident  light  from  an  asymmetric  "half¬ 
moon"  extended  source  depicted  in  figure  14.  The  half  angle  used  is 
0.4  radians  for  die  extended  source  and  the  expression  for  the 
reflected  radiance  in  section  2  was  numerically  integrated  over  the 
solid  angle  subtended  by  the  source.  The  size  of  the  extended  source 
was  selected  to  be  large  to  emphasize  the  process  of  breaking  the 
inherent  "flip"  symmetry.  Unique  surface  orientation  measurements 
from  much  smaller  extended  sources  are  possible. 

Of  course  all  three  reflected  radiance  measurements  can  be 
taken  from  the  same  light  source  with  a  variable  aperature  diaphram 
in  front.  For  the  first  two  measurements  the  aperature  is  a  small 
"pinhole".  For  the  third  measurement  the  aperature  is  changed  to  the 
desired  asymmetric  shape  and  size.  As  mentioned  above,  it  is 
possible  for  other  more  complicated  symmetries  to  exist  for  resulting 
reflectance  curves  that  may  hinder  the  process  of  measuring  certain 
surface  orientations  from  spectral/polarization  methods.  Further 
changes  to  the  shape  of  the  extended  source  may  then  be  useful,  but 
at  least  the  primary  "flip"  symmetry  is  broken. 


COMBINED  SPECU.AR  0.0  OTFL'SE  REFLECTANCE  SLAP 
R  -OJT  2  O'Ri  -  Rd  | 

P,-I0Q*-)0 

MAGNESIUM  OWDE  N.t  !7  K-0  00 
RMS  SLOPE  VALLE.  2  0 


Figure  13 


8.  CONCLUSION 

It  has  been  demonstrated  through  computer  simulation  of  equi- 
reflectance  curves  that  arise  from  the  extended  Tonrance-Sparrow 
reflection  model  that  the  measurement  of  surface  orientation  is 
obtainable  by  using  stereo  techniques  that  vary  the  wavelength 
and/or  the  polarization  of  incident  light  from  a  single  light  source 
while  leaving  the  imaging  geometry  invariant.  Different  techniques 
were  explored  using  two  very  different  types  of  materials. 
Magnesium  oxide  and  Aluminum.  This  was  to  show  the  applicability 
of  spectral/polarization  stereo  methods  to  a  large  class  of  material 
surfaces. 


The  most  viable  technique  presented  was  that  of  polarization 
stereo  for  dielectrics.  This  was  based  on  the  observation  that  the 
specular  component  equi-reflectance  curves  exhibit  strong 
intersections  with  diffuse  component  equi-reflectance  curves  over  a 
large  region  of  gradient  space.  Because  the  Fresnel  reflection 
coefficient  for  parallel  polarized  light  is  small  in  a  large  region  of 
angles  of  incidence  below  the  Brewster  angle,  it  is  possible  to  force 
the  total  reflectance  function  to  be  almost  identical  to  its  pure  diffuse 
(i.e.  Lambertian)  reflection  component  for  a  phase  angle  well  below 
90°.  The  Fresnel  reflection  coefficient  for  perpendicular  polarized 
light  in  this  region  is  significant  forcing  the  total  reflectance  function 
close  to  its  specular  reflection  component.  Thus  using  only  two 
measurements  of  reflected  radiance  from  parallel  polarized  and 
perpendicular  polarized  incident  light  respectively,  surface 
orientation  can  be  measured  with  relatively  small  error. 

Spectral  stereo  for  dielectrics  is  considered  to  be  the  least 
practical  due  to  the  insensitivity  of  the  simple  index  of  refraction  to 
changes  in  wavelength.  Spectral/polarization  stereo  for  conductors  is 
best  when  wavelength  and  polarization  are  simultaneously  varied  so 
as  to  maximize  the  difference  in  the  Fresnel  reflection  coefficients 
between  successive  images.  The  main  issue  for  spectral  stereo 
applied  to  dielectrics  and  spectral/polarization  methods  applied  to 
conductors  is  resolution  of  intersection  points.  In  these  cases  tire 
shapes  of  equi-reflectance  curves  are  relatively  invariant  to  changes 
in  incident  wavelength  and  polarization  even  though  the  change  in 
the  reflected  radiance  for  conductors  can  be  fairly  significant.  A 
technique  was  presented  to  resolve  the  specular  and  diffuse 
component  values  of  the  reflected  radiance  by  solving  two  image 
irradiance  equations  simultaneously.  Surface  orientation  is  measured 


819 


from  die  strongly  discernable  intersection  of  specular  and  diffuse 
component  equi-reflectance  curves.  However,  for  small  variations  in 
the  Fresnel  reflection  coefficient,  the  associated  errors  can  be  quite 
high. 

Finally  it  was  formally  shown  how  an  isotropic  reflectance 
function  induces  a  "flip”  symmetry  on  resulting  equi-reflectance 
curves  in  gradient  space  about  the  source-viewing  axis.  Using  a  point 
light  source  it  is  therefore  theoretically  impossible  to  measure  better 
than  a  up  to  a  two  point  ambiguity,  surface  orientations  that  do  not 
lie  on  the  source-viewing  axis  using  spectral/polarization  techniques. 
A  technique  using  a  single  extended  light  source  was  presented  to 
resolve  this  ambiguity  by  breaking  the  inherent  "flip”  symmetry.  By 
using  an  aperature  which  itself  is  not  "flip"  symmetric,  equi- 
reflectance  curves  that  pass  through  the  true  surface  orientation  point 
can  be  produced  that  do  not  pass  through  the  false  second  point  of 
ambiguity. 

Experimentation  is  currently  underway  to  empirically  verify 
the  theoretical  development  presented  in  this  paper. 


[Wolff  1986] 

Wolff,  L.B. 

Physical  Stereo  For  Combined  Specular  and  Diffuse 
Reflection 

Tech  report  CS-242-86,  Columbia  University,  Nov.  1, 
1986 


[Woodham  1980] 

Woodham,  R  J. 

Photometric  Method  for  Determining  Surface 
Orientation  from  Multiple  Images 
Optical  Engineering,  vol.19  #1  pp.  139- 144,  Jan-Feb 
1980 


'This  research  wu  supported  in  put  by  ARPA  (nut  SN00039-84-C-0165. 


ACKNOWLEDGEMENTS 

Many  thanks  go  to  the  people  that  reviewed  earlier  drafts  of 
this  work.  In  particular,  Terry  Boult  provided  a  very  thoughtful 
review  and  helpful  suggestions. 


REFERENCES 


[Beckmann  and  Spizzichino  1963] 

Beckmann,  P.,  and  Spizzichino,  A. 

The  Scattering  of  Electromagnetic  Waves  from  Rough 

Surfaces 

Macmillan,  1963 


[Cook  and  Torrance  1981] 

Cook,  R.L.,  and  Torrance,  K.E. 

A  Reflectance  Model  For  Computer  Graphics 
SIGGRAPH  1981  Proceedings,  vol.15  #3  pp.307-316, 
1981 


[Physics  Handbook] 

American  Institute  of  Physics  Handbook 
Third  Edition 
McGraw-Hill,  1972 


[Siegel  and  Howell  1981] 

Siegel,  R.  and  Howell,  J.R. 
Thermal  Radiation  Heat  Transfer 
McGraw-Hill,  1981 


[Torrance  and  Sparrow  1967] 

Torrance,  K.E.,  and  Sparrow,  E.M. 

Theory  for  Off-Specular  Reflection  From  Roughened 
Surfaces 

Journal  of  The  Optical  Society  of  America,  vol.  37  *9 
pp.  1105- 1114,  September  1967 


820 


SURFACE  CURVATURE  AND  CONTOUR  FROM  PHOTOMETRIC  STEREO 

Lawrence  Brill  Wolff 


Department  of  Computer  Science 
Columbia  University 
New  York,  NY  10027 


ABSTRACT 


Photometric  stereo  has  been  used  to  derive  the  normal  map  for 
a  smooth  surface  depicting  local  surface  orientation  at  each  point 
Many  robotic  vision  applications  require  higher  level  descriptions  of 
surface  shape  such  as  Gaussian  curvature  and  contour  lines. 
Presented  here  is  an  enhancement  of  conventional  photometric  stereo 
that  directly  computes  the  Gaussian  curvature  as  well  as  principal 
curvatures  and  their  directions  at  each  point  on  an  arbitrary  smooth 
surface.  First,  image  irradiance  values  from  successive  images  are 
used  to  determine  local  surface  orientation.  Then  image  gradient 
values  from  the  same  set  of  successive  images  are  used  to  compute 
the  image  Hessian  matrix  which  gives  complete  knowledge  about  the 
second  order  rate  of  change  of  the  smooth  surface  with  respect  to  the 
image  plane.  It  is  shown  that  this  provides  enough  information  to 
compute  the  curvature  matrix  for  a  visible  point  on  the  smooth 
surface.  The  Gaussian  curvature  for  each  point  is  generated  by 
taking  die  determinant  of  the  curvature  matrix.  Principal  directions 
of  curvature  are  the  eigenvectors  of  the  curvature  matrix.  Lines  of 
curvature  can  be  generated  by  piecing  together  tangent  lines  that  run 
parallel  to  a  principal  direction  of  curvature  at  each  point 

1.  INTRODUCTION 


Hessian  from  a  single  image.  For  instance  for  developable  surfaces 
(which  results  in  at  least  one  principle  curvature  being  zero)  all 
components  of  the  image  Hessian  can  be  solved  completely.  The 
photometric  stereo  method  presented  here  eliminates  the  need  for 
any  additional  assumptions  about  the  smooth  surface  with  respect  to 
fully  recovering  the  image  Hessian. 

It  is  demonstrated  in  section  3  mat  once  local  surface 
orientation  is  known,  the  image  Hessian  can  be  completely  solved 
for  an  arbitrary  smooth  surface  from  two  or  more  successive  images. 
Therefore  the  derivation  of  the  image  Hessian  can  be  combined  with 
conventional  photometric  stereo  by  first  computing  local  surface 
orientation  from  successive  images  and  then  using  at  least  two  of 
these  successive  images  to  solve  for  the  components  of  the  image 
Hessian  directly.  Even  though  using  more  successive  images  would 
severely  overconstrain  the  solution  for  the  Hessian  matrix,  the 
solution  would  be  more  statistically  accurate  using  experimental 
values. 

2.  THE  IMAGE  HESSIAN  AND  THE  CURVATURE  MATRIX 


The  image  irradiance  equation  relating  image  irradiance  l(x,y) 
to  reflected  scene  radiance  is  written  as: 


The  photometric  stereo  technique  presented  here  makes  use  of 
the  image  Hessian  matrix  presented  in  [Woodham  1981].  The  image 
Hessian  has  also  been  referred  to  as  the  curvature  matrix  of  the 
smooth  surface  being  imaged.  There  is  however  a  significant 
difference  between  the  Hessian  matrix  for  a  smooth  surface 
parameterized  by  its  height  above  the  image  plane  and  the  curvature 
matrix  derived  from  the  standard  definition  using  the  differential  of 
the  Gauss  map.  It  will  be  shown  in  section  2  that  while  the 
determinant  of  the  Hessian  matrix  agrees  in  sign  with  the  Gaussian 
curvature  that  these  two  values  are  different  quantitatively. 


Computing  the  sign  of  the  determinant  of  the  Hessian  matrix 
obtained  from  photometric  stereo  provides  a  quick  method  for 
determining  whether  die  smooth  surface  is  convex,  concave  or  flat  at 
a  point  The  curvature  matrix  at  a  point  can  be  quantitatively  derived 
from  the  components  of  the  Hessian  matrix  aid  knowledge  of  the 
local  surface  orientation.  Therefore  Gaussian  curvature  and  lines  of 
curvature  can  be  directly  computed  from  image  values  without 
referring  to  neighboring  values  for  normals  from  the  normal  map. 


It  was  shown  in  [Woodham  1981]  that  the  equations  used  to 
Solve  for  the  image  Hessian  matrix  from  a  single  image  are 
iundercons trained  solving  for  three  variables  with  two  equations. 
(Auxiliary  constraints  can  aid  in  the  process  of  deriving  the  image 


/(xo>)  -  R(p,q) 


where  R(p,q)  is  the  reflectance  map  as  a  function  of  gradient  space 
variables  p  and  q.  If  the  height  of  the  smooth  object  surface  above 
the  image  plane  is  given  by  the  twice  differentiable  function  h(x,y) 
then  the  smooth  object  surface  can  be  parameterized  using  image 
coordinates  by  (x,  y,  h(x,y))  and 


P 


dh 

dx 


9 


— 

dy 


The  coordinate  system  used  here  is  depicted  in  figure  1. 
Orthographic  projection  of  the  object  surface  onto  the  image  plane  is 
assumed  throughout  Taking  the  partial  derivative  of  the  image 
irradiance  equation  with  respect  to  the  variables  x  and  y  gives  the 
two  equations 


/,  -  dpldxRp  +  dq/dxR^ 

ly  -  dpidyR,  +  tydyR' 
which  in  turn  gives  rise  to  the  vector  equation 


821 


where 


1 


(1) 


( l\  ( dp/dx  dq/dx\  (R\ 
\lys  ~  \dp/dy  dqldy)  \R?/  ' 


The  triage  Hessian  is  defined  as 


(2) 


fF-eG 

EG-F2’ 


gF-fG 

EG-F2' 


a 


21 


eF-fE 

EG-F2’ 


an 


JF-gE 
EG-F 2  ' 


The  terms  E,F  and  G  are  the  coefficients  for  the  first  fundamental 
form  and  the  terms  e,f  and  g  are  the  coefficients  for  the  second 
fundamental  form.  Equations  (2)  are  called  the  Weingarten 
equations.  It  turns  out  that  the  curvature  matrix  is  a  self-adjoint 
linear  mapping  and  that  (ajj)  can  be  made  symmetric  by  performing  a 
similarity  transformation  to  an  orthogonal  basis. 


For  the  image  plane  parameterization  $(x,y)  -  (x,  y,  h(x,y))  the 
basis  vectors  spanning  the  tangent  plane  at  a  point  are 

(3)  9, -(1,0  A,)  ^-(0.1  A,) 


and  the  first  and  second  fundamental  forms  are  given  by 


/ dp/dx  dqldx\  / d2hldx2  d2h/dxdy\ 

\dp/3y  dqldy)  *  \d2hldxdy  ifhldy2)  ' 

Since  the  image  Hessian  is  a  symmetric  two-by-two  matrix,  it  is 
determined  by  three  independent  parameters  hxv  hjy  and  hyy. 

Define  the  Gauss  map  to  be  a  function  N  from  a  smooth 
surface  S  to  the  unit  2-sphere  S2  mapping  each  point  on  S  to  its 
respective  tangent  plane  unit  normal.  That  is 

N:S  ->S2  S2-  {(x,y,z)  €  R3:  ||(x,y,z)||  =  1  }  . 

The  Gauss  map  for  a  smooth  surface  S  is  equivalent  to  the  normal 
map  for  S  using  unit  normals.  Given  a  differentiable  mapping  F 
between  two  smooth  surfaces  Sj  and  S2  the  differential,  dF,  of  F  is  a 
mapping  of  tangent  vectors  from  the  tangent  plane  at  each  point  p  on 
Sj  to  the  tangent  vectors  of  the  tangent  plane  at  F(p)  on  S2.  The 
mapping  dF  is  induced  by  mapping  the  tangent  vectors  of 
parameterized  curves  going  through  p  on  St  to  the  tangent  vectors  of 
the  image  of  these  parameterized  curves  under  F  going  through  F(p) 
on  S2.  For  a  formal  definition  of  what  it  means  for  a  function 
between  two  smooth  surfaces  to  be  differentiable  and  a  formal 
definition  of  the  differential  of  a  function  see  [Do  Carmo  1976], 

The  standard  definition  of  the  curvature  matrix  for  the  smooth 
surface  S  is  as  the  differential  dN  of  die  Gauss  map  N.  Therefore  the 
curvature  matrix  at  a  point  p  on  S  is  the  two-by-two  matrix  which 
when  multiplying  a  vector  v  in  the  tangent  plane  Tp(S)  at  p  gives  the 
rate  of  change  of  the  surface  normal  at  p  along  v.  For  a  given 
parameterization  $(x,y)  a  derivation  of  the  curvature  matrix  is  given 
using  the  basis  vectors  $x  and  $y  spanning  the  tangent  plane  at  p.  In 
this  case  the  curvature  matrix  for  S  can  be  represented  by 


(4a)  £  »  1  +  h2,  F-hhy,  G-\+h2, 


(4 b)  e-  °  ,  f - k- - g  ■ 

V 1  +  h2  +  h2  V 1  +  h2  +  h2  V 1  +  h2  +  h2 

The  Gaussian  curvature  K  is  defined  to  be  the  determinant  of 
the  curvature  matrix.  Since  the  curvature  matrix  is  a  self-adjoint 
linear  mapping  it  has  two  orthogonal  eigenvectors  which  are  termed 
the  principal  directions  of  curvature.  The  respective  eigenvalues  k, 
and  kj  are  termed  the  principle  curvatures  with  K=k,*k2.  The  mean 
curvature  is  given  by  H-l^fkj+kj).  The  derivation  of  the  Gaussian 
curvature  and  the  mean  curvature  using  the  image  plane 
parameterization  above  can  be  found  in  [Do  Carmo  1976],  These  are 
given  by 


K  _  a  w  2. 

~(\+h2+h2)2’ 

(i<)y-2y,v(i+/.>a 

(1  +h2  +  h2)V2 

From  the  expression  for  K  above  it  is  clear  that  the 
determinant  of  the  image  Hessian  has  the  same  sign  as  the  Gaussian 
curvature  and  that  both  are  either  zero  or  non-zero  simultaneously. 
Therefore  the  determinant  of  the  image  Hessian  gives  a  good 
qualitative  measure  of  the  Gaussian  curvature.  While  the  Hessian 
matrix  is  not  truly  the  standard  curvature  matrix,  its  components  can 
be  substituted  into  equations  (2)  and  (4)  along  with  hx  and  h^ 
obtained  from  the  determination  of  local  surface  orientation.  The 
expressions  for  K  and  H  are  uniquely  determined  from  this 
information.  In  section  4  it  will  be  seen  that  lines  of  curvature  can 
also  be  generated  from  this  given  information.  Note  from  equations 
(4b)  that  the  Hessian  matrix  coincides  with  the  second  fundamental 
form  at  critical  points,  that  is  points  where  h^-hy-0.  Also,  the 
determinant  of  the  Hessian  matrix  is  equal  to  the  Gaussian  curvature 
at  critical  points. 


822 


3.  DETERMINATION  OF  THE  IMAGE  HESSIAN  FROM 
PHOTOMETRIC  STEREO 


The  image  Hessian  is  determined  from  photometric  stereo  in 
analogous  fashion  to  how  conventional  photometric  stereo 
determines  surface  orientation.  Surface  orientation  is  determined 
from  the  intersection  of  equi-reflectance  curves  in  gradient  space. 
Each  equi-reflectance  curve  arises  from  the  measurement  of  image 
irradiance  in  each  successive  image,  h  was  shown  in  [Woodham 
1980]  that  for  a  Lambertian  reflectance  map  at  least  three  successive 
images  each  using  a  light  source  at  a  different  incident  orientation  is 
required  to  uniquely  determine  any  surface  orientation.  Once  the 
surface  orientation  (p,q,-l)  is  known,  the  reflectance  gradient  (Rp, 
R?)  is  also  known.  Substituting  the  image  gradient  value  (IK,Iy)  from 
a  single  image  into  vector  equation  (1)  in  section  2  gives  rise  to  the 
intersection  of  two  planes  in  a  3-dimensional  feature  space  spanned 
by  hxx,  hxy  and  h™.  The  components  of  the  image  Hessian  are 
therefore  constrained  by  the  equation  of  a  line. 


The  minus  sign  on  k  results  from  the  definition  of  the  second 
fundamental  form  IL  at  point  p  defined  by  IIp(v)—  -<dNp(v),  v>. 
Hence  the  matrix  aN+kl  cannot  be  invertible  and  must  have 
determinant  aero.  Therefore, 


de,(a"+k  M-0 

\0,,  a„  +  */ 


21  22 

resulting  in  the  characteristic  equation 


42+k(au+a22)+a,|a22-a21at2-  0. 

Since  the  sum  of  the  roots  is  equal  to  negative  the  coefficient 
of  the  linear  term  and  H-l/2(k1+k2),  the  characteristic  equation  can 
be  written 


Since  the  image  plane  remains  stationary  for  successive 
images  in  photometric  stereo  the  components  of  the  image  Hessian 
can  be  further  constrained  by  using  the  image  gradient  (lx’,Iy‘)  from 
a  second  image  knowing  the  reflectance  gradient  (Rp’.R^').  The 
prime  notation  for  R’(p,q)  refers  to  the  same  reflectance  function 
R(p,q)  but  using  the  different  light  source  incident  orientation  which 
is  used  for  the  second  image.  The  vector  equation 


//x'\  / Upldx  dq/dx 
\/y'/  “  \dp/dy  dq/dy 


is  combined  with  equation  (1)  to  give  four  sets  of  three  simultaneous 
planar  equations  each  with  a  point  solution.  Ideally  all  point 
solutions  should  be  identical.  Since  empirical  measurements  have  an 
inherent  error,  the  point  solutions  are  likely  to  form  a  cluster  and  a 
statistical  method  can  be  used  to  select  the  best  point  value 
associated  with  the  cluster.  If  n>2  successive  images  are  used  only 
sets  of  three  simultaneous  equations  containing  all  three  variables 
hxx,  hxy  and  hyy  produce  point  solutions.  This  results  in  a  total  of 
C(2n,3)-2C(n,3)  point  solutions  forming  a  cluster. 


i.  LINES  OF  CURVATURE  FROM  PHOTOMETRIC 
STEREO 


A  connected  curve  C  on  a  smooth  surface  S  is  a  line  of 
curvature  of  S  if  for  each  point  p  that  C  passes  through  the  tangent 
line  is  a  principal  direction  of  curvature  at  p  on  S.  Lines  of  curvature 
accentuate  the  shape  of  a  smooth  object  and  provide  a  good  contour 
representation.  It  will  be  shown  here  how  a  line  of  curvature  can  be 
traced  using  the  curvature  matrix.  Since  it  was  shown  in  sections  2 
and  3  how  the  curvature  matrix  could  be  directly  recovered  from 
photometric  stereo,  the  lines  of  curvature  can  be  derived  from  this 
method. 


l?-2Hk+K  =  0 


with  roots 


k  -  H  ±  V H2~K  . 


The  eigenvectors  1®,  i— 1,2  solve  the  homogeneous  system 


+  t. 
an 


o. 


Hence  1®  can  be  any  scalar  multiple  of  the  vector 


+k’ 


1). 


The  principal  directions  of  curvature  in  3-dimensions  are  given 
by 


using  equations  (3)  in  section  2.  To  generate  a  line  of  curvature 
starting  at  a  point  p  on  S  select  a  principal  direction  of  curvature.  The 
projection  of  this  direction  onto  the  image  plane  is  (1,(0,  12<0). 
Proceed  for  a  distance  As  in  die  image  plane  along  this  vector  until 
another  sampled  image  point  p’  is  reached.  If  the  image  coordinates 
for  point  p  are  (xp,  yp,  the  image  coordinates  for  the  new  image 
point  p’  will  be 


To  obtain  the  two  orthogonal  eigenvectors  of  the  matrix  (a^), 
the  characteristic  roots  must  first  be  derived.  As  mentioned  in  section 
2  these  are  the  principal  curvatures  k,  and  k2.  Letting  k  represent  any 
one  of  the  principal  curvatures  and  1  be  the  identity  matrix,  die  vector 
v  in  the  tangent  plane  of  a  point  p  on  S  is  an  eigenvector  if 

dN(v)  -  -kv  -  -kjv  . 


I  (0  ;  (0 

(y —  1,„=Ar.  y+  .  2  — — • 
V(/«)M/«)2  V(/«)2 +(/»)* 

At  p*  select  the  principal  direction  of  curvature  which  is  closest  to 
the  principal  direction  of  curvature  from  the  previous  sampled  point 
and  continue  the  above  process  from  point  to  point.  Since  the 
principal  directions  of  curvature  are  orthogonal  there  should  be  no 


823 


problem  selecting  the  proper  directions  at  each  point  for  a  reasonably 
well  sampled  image  unless  a  line  of  curvature  has  a  very  severe 
"bend".  In  this  way  a  map  of  tangent  vectors  can  be  generated  for  a 
line  of  curvature  using  equation  (5).  The  tangent  vectors  should  be 
normalized  according  to 


>/(02  +  (I2‘)2  -  A* 


for  each  As  between  sample  points  in  the  image.  It  is  then 
straightforward  to  construct  the  actual  line  of  curvature  in  3- 
dimensions  up  to  vertical  translation  from  the  image  plane. 

5.  CONCLUSION 


It  has  been  demonstrated  that  the  curvature  matrix  for  a 
smooth  surface  can  be  direcdy  derived  from  a  photometric  stereo 
method  that  utilizes  the  measurement  of  image  irradiance  and  the 
image  gradient.  The  key  difference  between  this  method  and 
previous  methods  of  photometric  stereo  is  the  use  of  image  gradient 
values  between  successive  images  to  completely  recover  the  image 
Hessian  matrix  for  an  arbitrary  smooth  surface.  Previous  derivations 
of  the  image  Hessian  matrix  required  auxiliary  assumptions  about  the 
smooth  surface. 


The  relationship  between  the  image  Hessian  matrix  and  the 
standard  curvature  matrix  was  clarified.  The  determinant  of  the 
image  Hessian  provides  an  accurate  measure  of  the  sign  of  the 
Gaussian  curvature.  Therefore  the  photometric  stereo  method 
presented  here  deriving  the  image  Hessian  is  a  very  efficient  way  of 
determining  local  convexity,  concavity  or  flatness  without  having  to 
refer  to  the  normal  map.  It  was  shown  that  the  image  Hessian  could 
be  used  together  with  knowledge  of  local  surface  orientation  to 
determine  the  curvature  matrix  for  an  arbitrary  smooth  surface. 


With  the  curvature  matrix  derived  it  was  shown  how  to 
generate  values  for  the  Gaussian  and  mean  curvatures.  A  method  was 
given  for  determining  lines  of  curvature  which  provide  a  good 
contour  representation  for  a  smooth  surface. 


REFERENCES 

(Do  Carmo  1976]  Do  Carmo,  M.P. 

Differential  Geometry  of  Curves  and  Surfaces 
Prentice-Hall,  1976 

[Woodham  1980]  Woodham,  R.J. 

Photometric  Method  for  Determining  Surface 
Orientation  from  Multiple  Images 
Optical  Engineering,  vol.19  #1  pp.  139-144,  Jan-Feb 
1980 


[Woodham  1981]  Woodham,  R.J. 

Analysing  Images  of  Curved  Surfaces 

Artificial  Intelligence,  vol.  17  pp.l  17-140,  August 

1981 


'This  research  wu  supported  is  put  by  ARP  A  (riot  1N00039-I4-C-0165. 


824 


QUALITATIVE  INFORMATION 
IN  THE  OPTICAL  FLOW 

Alessandro  Verri  and  Tomaso  Poggio 


Massachusetts  Institute  of  Technology 
Artificial  Intelligence  Laboratory 


ABSTRACT 

In  this  paper  we  show  that  the  optical  flow,  a  2-D  field 
that  can  be  associated  with  the  variation  of  the  image 
brightness  pattern,  and  the  2-D  motion  field,  the  pro¬ 
jection  on  the  image  plane  of  the  S-D  velocity  field  of 
a  moving  scene,  are  in  general  quantitatively  different, 
unless  very  special  conditions  are  satisfied.  The  optical 
flow,  therefore,  is  ill-suited  for  computing  structure  from 
motion  and  for  reconstructing  the  S-D  velocity  field, 
problems  that  require  an  accurate  estimate  of  the  2-D 
the  2-D  motion  field.  As  a  consequence,  we  argue  that 
the  optical  flow  should  be  used  to  yield  information  of 
a  more  qualitative  type,  such  as  motion  discontinuities. 
We  also  show  how  the  (smoothed)  optical  flow  and  2- 
D  motion  field,  interpreted  as  vector  fields  tangent  to 
flows  of  planar  dynamical  systems,  have  the  same  qual¬ 
itative  properties  from  the  point  of  view  of  the  theory  of 
structural  stability  of  dynamical  systems. 


1.  INTRODUCTION 

A  key  task  for  many  vision  systems  is  to  extract  infor¬ 
mation  from  a  sequence  of  images.  This  information  can 
be  useful  to  solve  important  problems  such  as  recover¬ 
ing  the  3-D  velocity  field,  or  segmenting  the  image  into 
parts  corresponding  to  different  moving  objects,  or  re¬ 
constructing  the  3-D  structure  of  surfaces.  The  recovery 
of  the  2-D  motion  field,  (that  we  define  as  the  projection 
on  the  image  plane  of  the  3-D  velocity  field)  is  thought 
to  be  an  essential  step  in  the  solution  of  these  prob¬ 
lems.  The  data  available,  however,  are  temporal  vari¬ 
ations  in  the  brightness  pattern.  These  variations  are 
usually  associated  with  a  perceived  motion  field,  called 
optical  flow  (Gibson,  1950;  Fennema  and  Thompson, 
1979;  Horn  and  Schunck,  1981).  In  order  to  recover 
the  2-D  motion  field,  the  assumption  that  the  2-D  mo¬ 
tion  field  and  the  optical  flow  coincide  has  often  been 
made.  It  must  be  noted  though,  that  this  assumption  is 
clearly  satisfied  only  in  the  case  in  which  variations  in 


the  brightness  pattern  correspond  to  markings  on  the 
visible,  3-D  surfaces.  In  fact  several  authors  have  de¬ 
veloped  algorithms  to  reconstruct  the  2-D  motion  field 
from  optical  flow  data  defined  only  at  locations  of  fea¬ 
tures  in  the  image  (Hildreth,  1984a, b;  Waxman,  1986). 
Examples  in  which  this  assumption  does  not  hold  are 
known  (Horn,  1986),  but  they  have  been  regarded  as 
pathological  cases.  As  a  matter  of  fact,  algorithms  that 
deal  with  the  recovery  of  the  2-D  motion  field  from  dense 
optical  flow  data  have  been  proposed,  with  the  more  or 
less  explicit  assumption  that  the  two  fields  are  the  same 
(Horn  and  Schunck,  1981;  Nagel,  1984;  Kanatani,  1985). 

In  this  paper  we  show  that  the  optical  flow  and  the 
motion  field  are  in  general  different,  unless  very  spe¬ 
cial  conditions  are  satisfied.  We  explicitly  compute  the 
difference  between  their  normal  components  (the  com¬ 
ponent  along  the  direction  of  the  gradient)  under  broad 
assumptions.  We  show  that  they  are  arbitrarily  close 
where  the  image  gradient  is  sufficiently  strong.  Hence, 
feature-based  matching  algorithms  that  rely  on  edges  of 
various  types  (including  texture  edges)  are  more  appro¬ 
priate  than  point-to-point  ones  to  solve  problems  that 
rely  on  accurate  recovery  of  the  2-D  motion  field,  such 
as  structure  from  motion.  One  may  then  ask,  what  is 
the  optical  flow  for?  In  the  second  part  of  the  paper  we 
suggest  that  meaningful  information  about  the  3-D  ve¬ 
locity  field  and  the  3-D  structure  can  be  obtained  from 
qualitative  properties  of  the  2-D  motion  field,  such  as 
its  discontinuities.  We  then  argue  that  this  information 
can  be  retrieved  directly  from  the  optical  flow  or  its  nor¬ 
mal  components.  As  an  example,  we  describe  a  specific 
approach  that  exploits  results  from  the  theory  of  stabil¬ 
ity  of  dynamical  systems.  A  more  detailed  analysis  of 
this  approach  will  be  presented  in  a  forthcoming  paper 
by  V.  Torre  and  coworkers. 

The  paper  is  divided  in  two  parts.  In  the  first, 
we  define  the  problem  and  we  state  explicitly  the  as¬ 
sumptions  that  we  have  used.  In  particular,  we  con¬ 
sider  in  detail  how  image  irradiance  can  be  related  to 
scene  radiance  in  the  esse  of  a  scene  consisting  of  non- 
lambertian  surfaces.  We  describe,  then,  a  method  that 


825 


allows  us  to  show  that  the  optical  flow  and  the  motion 
field  are  almost  always  different.  We  compute  the  differ¬ 
ence  between  the  normal  components  of  the  two  fields 
assuming,  first,  the  lambertian  model  of  reflectance  and 
then  a  more  realistic  one  for  arbitrary  rigid  motion  of 
a  generic  surface.  We  also  calculate  how  this  difference 
depends  on  the  image  gradient  and  the  3-D  velocity  of 
moving  objects.  In  the  second  part  we  show  how  both 
the  optical  flow  and  the  motion  field  can  be  processed 
to  become  vector  fields  tangent  to  flows  of  dynamical 
systems.  The  optical  flow  then,  can  be  considered  as  a 
perturbed  motion  field  under  the  conditions  determined 
in  the  first  part.  Results  from  the  theory  of  stability  of 
dynamical  systems  suggest  that  qualitative,  stable  prop¬ 
erties  of  the  motion  field  hold  for  the  optical  flow.  We 
sketch  some  example  of  these  properties  and  how  they 
can  be  used  in  a  description  of  the  3-D  velocity  field.  We 
finally  discuss  briefly  some  connections  with  biological 
systems. 


2.  PRELIMINARIES 

In  this  chapter  we  review  the  definitions  of  motion  field 
and  optical  flow,  and  we  state  the  assumptions  that  we 
used  throughout  the  paper.  In  particular,  we  consider  in 
detail  how  image  irradiance  can  be  related  to  scene  radi¬ 
ance  in  the  case  of  a  scene  consisting  of  non- lambertian 
surfaces. 


2.1.  Definitions 

Let  us  define  notations  and  summarize  definitions  that 
will  be  useful  in  what  follows.  Throughout  the  follow¬ 
ing  we  will  assume,  if  it  is  not  otherwise  stated,  that 
any  expression  can  be  differentiated  as  many  times  as 
nedeed.  Let 

£,=  ,  4—  (x-(f-fi)n)  (2.1.1) 

be  the  equation  defining  the  projection  of  a  generic  point 
on  the  image  plane,  where  xp  =  ( xp,yp,0 )  is  the  posi¬ 
tion  vector  of  the  projected  point,  x  =  (x,y,z)  is  the 
position  vector  of  the  point,  n  is  the  unit  vector  normal 
to  the  image  plane  (projection  plane)  and  /  is  the  focal 
length  (see  Figure  1).  Notice  that  the  origin  0  is  on 
the  image  plane,  the  focus  of  projection  F  is  located  at 
(0,0,  -/),  and  fn  +  x  is  the  vector  pointing  from  F  to 
the  point. 


The  motion  field  vp  can  be  obtained  differentiating 


Figure  1.  The  geometry  of  perspective  projection. 


(2.1.1)  with  respect  to  the  time.  If  v  =  dx/dt  we  have  1 


~ — ~(>>  —  (v- n)n-  T-%P  -(*-(*•  «)«)) 


f  +  x  n 


f  +  x  •  n 


(2.1.2) 

Notice  that  in  (2.1.2)  vp  is  given  in  terms  of  x  and  v, 
position  and  velocity  of  the  moving  points  in  the  scene, 
which  are  not  known. 


Let  E  =  E(xp,yl„t)  be  the  image  irradiance,  that 
is  the  intensity  of  light  at  the  point  (xp,  yv)  of  the  image 
plane  at  the  time  t.  If  Vp  is  the  gradient  with  respect 
to  the  image  coordinates,  then 


dE  dE  „  „  „ 
dt  ~  dt  +VrE  vr 

(2.1.3) 

Now  if 

dE  „ 
dt  ~° 

(2.1.4) 

then,  if  ||VP£||  ^  0 

dE/dt  VPE  ■  vp 
l|V„£||  ||V,£|| 

(2.1.5) 

Therefore,  if  (2.1.4)  holds,  the  projection  of  the  motion 
field  along  the  direction  of  the  gradient  can  be  given  in 
terms  of  derivatives  of  the  image  irradiance  (which  can 
be  computed).  In  what  follows,  this  component  will  be 
called  vx,  or  the  normal  component ;  thus 


_  VVpE  VpS 
^  l|V,E|l  IIV.EH 


(2.1.6) 


1  It  can  be  easily  shown  that  the  perspective  projection  of 
the  3-D  velocity  vector  is  equal  to  the  velocity  of  the  pro¬ 
jected  point  on  the  image  plane,  since  both  the  vector  are 
defined  in  terms  of  infinitesimal.  This  is  not  true  for  a 
generic,  finite  vector 


826 


Equation  (2.1.5)  can  be  interpretated  as  an  in¬ 
stance  of  the  well-known  aperture  problem  (Marr  and 
Ullman,  1981;  Horn  and  Schunck,  1981):  the  informa¬ 
tion  available  at  each  point  of  a  sequence  of  frames  is 
only  the  component  of  the  motion  field  along  the  direc¬ 
tion  of  the  image  gradient.  To  recover  a  full  and  unique 
motion  field,  some  other  constraint  is  needed:  Horn  and 
Schunck  (1981),  for  example,  showed  that  there  is  only 
one  2-D  field  whose  normal  component  coincides  with 
(2.1.6)  and  which  is  the  smoothest  of  all  possible  ones. 
Examples  for  which  (2.1.5)  is  not  true  are  well  known 
(Horn  and  Schunck,  1981).  Consider,  for  instance,  a 
rotating  sphere  with  no  texture  on  it  (i.e.  with  uni¬ 
form  albedo)  under  arbitrary,  fixed  illumination.  Since 
the  image  irradiance  at  each  image  location  does  not 
change  with  time,  the  left-hand  side  of  (2.1.5)  is  identi¬ 
cally  equal  to  zero,  while  the  right-hand  side  is  different 
from  zero  almost  everywhere.  Notice  that  keeping  the 
sphere  fixed  and  moving  the  light  source  (2.1.5)  is  again 
wrong.  In  this  case,  however,  the  left-hand-side  is  differ¬ 
ent  from  zero  while  uj_  is  zero  everywhere.  In  both  cases 
the  perceived  motion  in  the  image  is  different  from  the 
motion  field.  It  is  worthwhile,  then,  to  introduce  a  new 
field,  called  the  minimal  optical  flow ,  related  to  the  per¬ 
ceived  motion  in  the  image,  and  not  necessarily  equal 
to  the  normal  component  of  the  motion  field.  Notice 
that  the  perceived  motion  in  the  examples  above  agrees 
qualitatively  with  the  left-hand-side  of  (2.1.5).  Indeed, 
the  optical  flow  in  the  first  case  is  identically  equal  to 
zero,  while  in  the  second  is  different  from  zero  almost 
everywhere.  Therefore  let  us  define  the  normal  compo¬ 
nent  Of  of  the  optical  flow  as: 


6  dE!dt  V>E 

F  HVpEH  ||V,£|| 


(2.1.7) 


Hence,  with  respect  to  this  definition,  the  minimal  op¬ 


tical  flow  and  the  normal  component  of  the  motion  field 


are  always  directed  along  the  gradient  and  they  coincide 


if  and  only  if  (2.1.4)  holds. 


Remark:  in  the  literature,  it  is  usually  assumed  that 
(2.1.4)  holds.  When  this  is  true,  the  normal  compo¬ 
nents  of  the  motion  field  and  the  minimal  optical  flow 
are  the  same  and  the  latter  can  be  used  as  a  constraint 
to  recover  the  2-D  motion  field. 


2.2.  Scene  Radiance  and  Image  Irradiance 

Let  us  review  briefly  some  definitions  of  photometry  and 
make  explicit  the  constraints  under  which  the  image  ir¬ 
radiance  is  related  to  the  scene  radiance.  The  image 
irradiance  E  is  the  power  per  unit  area  of  light  at  each 
.  point  (xp,yp)  of  the  image  plane:  thus  E  =  E(xp,yp). 
•  The  scene  radiance  L  is  the  power  per  unit  area  of  light 


that  can  be  thought  emitted  by  each  point  of  a  surface 
S  in  the  scene  in  a  particular  direction.  This  surface 
can  be  fictitious,  or  it  may  be  the  actual  radiating  sur¬ 
face  of  a  light  source,  or  the  illuminated  surface  of  a 
solid.  The  scene  radiance  can  be  thought  as  a  func¬ 
tion  of  the  point  of  the  surface  and  of  the  direction  in 
space.  If  (a,  6)  are  intrinsic  coordinates  of  the  surface 
and  (a,  ft)  polar  coordinates  determining  a  direction  in 
space  with  respect  to  the  normal  to  the  surface,  we  can 
write  L  =  L(a,b,a,  fi).  Given  the  scene  radiance  it  is 
possible,  in  principle,  to  compute  the  expected  image 
irradiance. 

Assuming  that  the  surface  is  lambertian,  i.e. 
L(a,b,a,fi)  =  L(a,b ),  that  there  are  not  losses  within 
the  system  and  that  the  angular  aperture  (on  the  image 
side)  is  small  it  can  be  proved  (Born  and  Wolf,  1959) 
that 

E(xp(a,  b),  yp(a,  b))  =  L(a ,  b)Q  cos4  (2.2.1) 

where  0.  is  the  solid  angle  corresponding  to  the  an¬ 
gular  aperture  and  is  the  angle  between  the  principal 
ray  (that  is  the  ray  passing  through  the  center  of  the 
aperture)  and  the  optical  axis.  With  the  further  as¬ 
sumption  that,  the  aperture  is  much  smaller  than  the 
distance  of  the  viewed  surface,  the  lambertian  hypoth¬ 
esis  can  be  relaxed  to  give  (Horn  and  Sjoberg,  1979) 

E(xp(a,  6),  yp(a.  b ))  =  L(a,  b,  o°,  cos4  v  (2.2.2) 

where  o°  and  3°  are  the  polar  coordinates  of  the 
direction  of  the  principal  ray.  It  must  be  pointed  out 
that  (2.2.2)  holds  if  L  is  continuous  with  respect  to  a 
and  3-  In  what  follows  we  will  assume  that  this  is  the 
case.  Furthermore,  we  will  assume  that  the  optical  sys¬ 
tem  has  been  calibrated.  Finally,  notice  that 

V„£FP  =  VSL.(^,^),  (2.2.3) 

where  Vs  is  the  gradient  with  respect  to  the  surface 
coordinates,  since  differentiating  (2.2.1)  we  have 

VpE  ■  (■ dxp,dyp )  =  VSL  ■  (da,  db).  (2.2.4) 

3.  MINIMAL  OPTICAL  FLOW  AND 
MOTION  FIELD 


We  describe  a  general  method  that  allows  us  to  show 
that  the  minimal  optical  flow  and  the  normal  compo¬ 
nent  of  the  motion  field  are  almost  always  different,  or 


827 


Figure  2.  Scene  radiance  and  image  irradiance  in  the  pinhole 
approximation:  the  image  irradiance  at  the  point  ( xp,yp )  is 
given  by  the  scene  radiance  at  the  point  (a,  6)  on  the  surface 
in  the  direction  of  the  line  connecting  the  two  points  and 
passing  through  the  pinhole  Ph ■ 


equivalently  that  (2.1.4)  does  not  hold.  We  compute 
the  difference  between  the  normal  components  of  the 
two  fields,  assuming  first  the  Lambertian  model  of  re¬ 
flectance  and  then  a  more  realistic  one  for  pure  transla¬ 
tion,  pure  rotation  and  general  rigid  motion  of  a  generic 
surface.  It  turns  out  that  the  two  fields  are  equal  only 
under  very  special  conditions,  which  can  be  explicitly 
stated.  We  also  show  that  the  difference  is  smaller 
where  the  image  gradient  is  stronger,  justifying  the  use 
of  feature-based  algorithms.  Of  course,  this  argument 
does  not  imply  that  feature-based  algorithms  should  be 
used:  it  says,  however,  that  locations  of  edges  (meant 
here  as  sharp  changes  in  intensity)  contain  most  of  the 
correct  information. 

3.1.  Computing  the  Minimal  Optical  Flow 

Consider  a  rigid  surface  5  moving  in  space  from  (2.2.1). 
The  image  irradiance  E  at  the  time  t  at  the  point 
{xp,  yp)  is  equal  to  the  scene  radiance  L  at  the  point 


(a,  6)  on  S,  i.e.  E{xp,yp,t)  =  L(a,b).  The  image  irra¬ 
diance  at  the  time  H-  At  is  given  by  the  scene  radiance 
of  the  surface  at  the  time  t  +  At.  As  shown  in  Figure  3, 
the  point  on  S  that  radiates  toward  ( xp ,  yp)  at  the  time 
t  +  At  is  the  point  (a  -  A  a,  b  -  A6).2  The  normal  N 
to  S  at  the  time  t  +  At  at  the  point  (a  -  Aa,f>  -  Aft), 
N, +a<(a  -  A  a,  6  -  A  b),  will  be 

■^(+Ai(u  —  Aa,  b  —  A  b)  =  Ni(a  —  A  a,  b  —  A  b)  +  A  N 

(3.1.1) 

where  AN  is  the  first  order  variation  of  N  due  to  the 
motion  of  S  during  the  time  interval  At.  Now  in  the 
case  of  translation 

AN  =  0  (3.1.2) 

while  in  case  of  rotation  with  angular  velocity  u> 

AN  =  w  x  NAt  (3.1.3) 

Notice  that  (3.1.3)  can  be  considered  as  the  expres¬ 
sion  of  AN  for  any  kind  of  motion.  Similarly,  for  each 
argument  A  of  the  scene  radiance,  we  can  write 

A,+A,(a,  ft)  =  Af(a,  b)  +  AA.  (3.1.4) 

To  compute  AA,  let  us  distinguish  between  argu¬ 
ments  of  L  that  are  intrinsic  function  of  the  surface  coor¬ 
dinates  (a,  6),  such  as  texture  and  albedo,  and  those  that 
are  in  fact  function  of  the  space  coordinates  (x,y,z), 
(such  as  the  illumination  and  the  point  of  view)  and  that 
are  expressed  in  terms  of  (a,  b)  only  for  convenience.  If 
A  is  an  intrinsic  function  of  the  surface  coordinates,  it 
follows  immediately  that 

AA  =  0,  (3.1.5) 

while  if  A  is  a  function  of  the  space  coordinates,  from 
the  Taylor  expansion  we  have 

A  A  =  VA  •  vAt,  (3.1.6) 

where  V  is  the  gradient  operator  with  respect  to  the 
space  coordinates.  Let  us  assume  that  L  can  be  written 
as  a  function  of  m  arguments  A',  i  =  1, ...,  m  and  of  N. 
Then,  taking  into  account  (3.1.3)  and  (3.1.4),  (2.2.1) 
becomes 

E{xp,  yp,  f  +  At)  =  L{A\(a  -Aa,b-  A b)  +  A A\ 

N,(a  -  Aa,  b  —  Ab)  4-  AN)  (3.1.7) 

at  time  f  +  At  and 

E(xp,  yp,  t)  =  L(A'(a,  b ),  JV,(a,  b ))  (3.1.8) 

at  time  t.  Therefore,  using  (3.1.6)  and  (3.1.7), 


2We  assume  that  the  surface  corresponds  to  a  moving  con¬ 
vex  body  to  avoid  self-occlusions  due  to  the  motion.  In 
fact,  the  computation  that  follows  holds  for  any  convex 
surface  patch. 


828 


\ 

\ 

\ 

\ 

\ 


Figure  3  Computing  the  minimal  optical  flow:  the  point 
(a,  6)  on  S  radiates  toward  (xp,  yp)  at  time  f.  The  point 
(a  -  Aa,6  —  A6)  radiates  toward  the  same  point  at  time 
t  +  Af.  The  normal  Ni  is  the  normal  to  the  S  at  the  point 
(a,  6)  and  at  (a  -  Aa,b  —  Ab). 


+  (3U0) 
i*=l 

Thus,  the  normal  components  of  the  two  fields  are  dif¬ 
ferent  if  the  surface  undergoes  a  motion  with  a  rota¬ 
tional  component,  or  the  reflectance  function  contains 
arguments  depending  on  space  coordinates. 

Let  us  consider  now  some  interesting  examples  in 
detail. 

3.2.  Translation  of  a  Lambertian  Surface 

Consider  a  lambertian  surface  S.  The  scene  radiance 
due  to  S  will  be 

L  =  pi- N  (3.2.1) 

where  p  is  the  albedo  of  S,  I  the  unit  vector  in  the  di¬ 
rection  of  the  illumination  and  AT  is  the  unit  normal  to 
the  surface.  Let  us  compute  the  difference  (3.1.10)  be¬ 
tween  the  normal  components  of  the  optical  flow  and  of 
the  motion  field  corresponding  to  a  translation  of  S  in 
space  with  velocity  v  under  uniform  fixed  illumination. 
Substituting  (3.2.1)  in  (3.1.10)  and  changing  the  sign, 
we  have 

v±  =  Of,  (3.2.2) 

since  u>  =  0  and  none  of  the  outputs  of  L  in  (3.2.1)  de¬ 
pends  on  space  constraints  ( I  is  constant).  Therefore, 
the  minimal  optical  flow  of  a  translating  lambertian  sur¬ 
face  uniformly  illuminated  is  exactly  equal  to  the  motion 
field. 


dE_ 

dt 

lim  -^-(L[A)(a  —  Aa,  b  —  AA)  +  AA', 

At— *0  At 

A?, (a  -  Aa,  6  -  Ab)  +  AN)  -  L(A\(a,  b),  AT, (a,  6))), 

(3.1.9) 

where  the  AA*  are  computed  using  (3.1.5)  or  (3.1.6) 
according  to  the  kind  of  argument.  From  (3.1.9),  the 
minimal  optical  flow  can  be  derived  easily.  To  simplify 
notation,  let  us  suppress  the  subscript  t  from  Equation 

(3.1.9). 


From  (3.1.9)  we  easily  get 


dE  _  _  (da  db\  ^  dL  _  „  T,  dl 

dt  "  v,L\dt’dt)+^dAiVA  'v+dNwxN 

'  1*1 


if  p  of  the  A'(i  =  l,...,m)  require  the  use  of  (3.1.6) 
to  compute  VA  and  ££  =  $7,  $7)  for  N  = 


Therefore,  using  (2.1.6),  (2.1.7),  and  (2.2.4),  we  can 
write 


Remark:  in  the  case  of  non-uniform  illumination  the 
right  hand  side  of  (3.2.2)  contains  an  extra  term  due  to 
A I.  Using  (3.1.6)  to  compute  the  components  of  A  I, 

(3.1.10)  yields 


-  Of 


1  tdldx 

WTWnox  it 


dldJL  +  dldS\  -n 

dy  dt  dz  dt  ) 


which  can  be  rewritten 
v±-  Of  — 


1  dl  - 
||Vj,£||Pd<  ' 


(3.2.3) 


since  dl/dt  =  0  (the  illumination  is  supposed  to  be 
fixed). 


3.3.  Rotation  of  a  Lambertian  Surface 

Let  S  be  a  lambertian  surface  rotating  in  space  with 
angular  velocity  u>.  Let  7  be  again  uniform.  Applying 
the  same  argument  of  the  previous  section  but  taking 
into  account  the  constraint  (3.1.3)  for  VAT,  we  get 


I 


829 


uj.  ~  Of  = 


pN  ■  I  x  u 

iiv^ji 


(3.3.1) 


In  the  case  of  rotation,  therefore,  even  under  uni¬ 
form  illumination,  the  minimal  optical  flow  and  the  nor¬ 
mal  component  of  the  motion  field  are  different.  They 
are  equal  for  any  surface  only  if  u>  and  I  are  parallel. 
This  corresponds  to  the  case  of  a  surface  rotating  around 
an  axis  parallel  to  the  direction  of  uniform  illumination. 
In  the  case  of  non-uniform  illumination,  an  extra  term 
like  the  one  in  (3.2.3)  must  be  added  to  (3.3.1). 


3.4.  Translation  of  a  Specular  Surface 


Let  us  consider  now  a  model  of  reflectance  more  real¬ 
istic  than  the  lainbertian  one.  Following  Phong  (1975; 
see  also  Horn  and  Sjoberg,  1979)  we  define  the  scene 
radiance  as  a  linear  combination  of  a  lambertian  and  a 
specular  term,  «.e. 

^  =  ^lamb  + -^spec-  (3.4.1) 

The  lambertian  term  is  equal  to  the  one  used  before, 
while  the  specular  term  is 


E  ~  D 


(3-4.2) 


where  s  is  the  fraction  of  light  reflected  by  the  surface, 
D  =  }n  -f  x  is  the  vector  pointing  from  the  focus  to  the 
radiating  point  and 

U  =  f-  2(1  ■  N)N  (3  4.3) 


is  the  unit  vector  in  the  direction  of  the  perfect  specular 
reflection.  Let  us  assume  that  s  is  not  a  function  of  the 
direction  of  the  incident  light  and  that  it  is  constant  on 
the  surface.  The  specular  term  is  thus  proportional  to 
the  cosine  of  the  angle  between  the  direction  of  specular 
reflection  and  the  line  of  sight. 


Since  we  are  computing  derivatives  and  L  is  a  lin¬ 
ear  combination  of  Liami,  and  L>rec  we  can  compute 
separately  the  contributions  to  the  minimal  optical  flow 
due  to  the  lambertian  and  the  specular  term,  adding 
the  results  afterward.  Therefore,  we  only  need  to  com¬ 
pute  now  the  specular  one.  Let  us  consider,  first,  the 
case  of  pure  translation  of  a  surface  S  radiating  accord¬ 
ingly  to  (3.4.2)  and  let  us  call  S  a  specular  surface.  If  S 
is  translating  with  velocity  v  and  I  is  uniform,  substi¬ 
tuting  (3.4.2)  into  (3.1.10)  and  taking  into  account  the 
constraint  (3.1.2),  we  have 


f  i  -  Of 


s  (D2v  R-(D  v)(D  R)) 

D 1  W,Ei 


(3.4.4) 


since  from  (3.1.6) 

AD  _  dD  di  dD  dy  dD  dz  _  dD  _  dx  _  _ 

ai2o  At  dx  dt  dy  dt+  dz  dt  dt  dt 

(3.4.5) 


Using  again  the  two  fields  we  get  a  well  known  vector 
identity: 


v±  -  0F  = 


s  (v  x  D)  ■  (R  x  D) 

W  jfv^jf 


(3.4.6) 


Thus,  in  the  case  of  translation  of  a  specular  surface, 
the  minimal  optical  flow  and  the  normal  component  of 
the  motion  field  are  always  different. 


3.5.  Rotation  of  a  Specular  Surface 

Consider  now  the  same  specular  surface  S  rotating  in 
space  with  angular  velocity  <*>.  Then,  substituting 
(3.4.2)  into  (3.1.10)  and  taking  into  account  the  con¬ 
straint  (3.1.3),  we  have,  after  some  algebra, 

^  -  °f  =  Wfv'-pE\\  (D2(/x  w) ' -  2(5  ' 

f(nxu)-(Dx(DxR))y  (3.5.1) 

The  minimal  optical  flow,  therefore,  is  equal  to  the 
motion  field  for  any  specular  surface  only  when  I,  u  and 
n  are  parallel. 

3.6.  General  Case 

Let  us  consider,  now,  the  general  case.  We  will  assume 
(3.4.1)  as  scene  radiance  of  a  surface  5  undergoing  a 
given  rigid  motion  (composition  of  a  rotation  and  a 
translation)  in  space.  Adding  together  (3.3.1),  (3.4.6) 
and  (3.5.2),  we  obtain  the  difference  between  the  mo¬ 
tion  field  and  the  minimal  optical  flow  for  a  surface  in 
the  general  case  under  uniform  illumination,  t.e.: 

pN  ■  F x  u  s  (v  x  D)  ■  (R  x  D) 

'•--O'-imr+w  iv,En  + 

-  f(n  x  w)  •  (5  x  (D  x  R))^ 

(3.6.1) 

The  right-hand  side  of  (3.6.1)  is  generally  different 
from  zero.  In  fact,  there  are  no  general  conditions  under 
which  it  is  identically  equal  to  zero.  Notice,  however, 
that  if  w  and  v  are  bounded 

lim  |vx-Of  |  =  0,  (3.6.2) 

|v,£|-.oo 

Equation  3.6.2  shows  that  the  points  in  the  im¬ 
age  where  the  gradient  is  stronger  are  the  points  where 
the  minimal  optical  flow  is  closer  to  the  motion  field. 
These  points  are  characterized  by  sharp  changes  in  in¬ 
tensity  -  edges  -  that  usually  correspond  to  important 


330 


physical  events  on  surfaces,  such  as  boundaries  and  ori¬ 
entation  discontinuities.  Thus,  to  solve  problems  such 
as  structure  from  motion,  or  the  recovery  of  the  3-D 
velocity  field,  which  require  an  accurate,  quantitative 
estimate  of  the  2-D  motion  field,  edge-based  algorithms 
seem  more  suitable  than  algorithms  based  on  spatial 
and  temporal  derivatives  of  the  image  brightness.  As 
a  consequence,  in  order  to  obtain  a  precise  reconstruc¬ 
tion  of  the  2-D  motion  field,  algorithms  based  on  the 
solution  of  the  correspondence  problem  among  suitably 
defined  features  (as  in  stereo)  may  be  used.  This  argu¬ 
ment  agrees  with  the  fact  that,  as  intuitively  expected, 
the  minimal  optical  flow  and  the  motion  field  at  im¬ 
age  features  corresponding  to  precise  locations  on  the 
3-D  surfaces  coincide.  It  must  be  pointed  out  that  in 
this  analysis  we  have  not  considered  shadows  and  self¬ 
shadow  effects.  They  usually  give  rise  to  edges  in  the 
image  that  do  not  correspond  to  features  in  the  scene. 
Furthermore,  the  Phong  model  of  reflectance  does  not 
include  sharp  intensity  changes  due  to  specularities. 


4.  QUALITATIVE  PROPERTIES:  NOT 
JUST  DISCONTINUITIES 


Traditionally,  the  optical  flow  has  been  considered  as  the 
first  step  for  recovering  3-D  structure  and  3-D  motion. 
In  this  chapter  we  suggest  a  different  use  of  the  mini¬ 
mal  optical  flow.  We  argue  that  qualitative  properties 
of  the  2-D  motion  field  give  useful  information  about 
the  3-D  velocity  and  the  3-D  structure  of  surfaces  and 
that  these  qualitative  properties  can  be  usefully  inferred 
from  the  obtainable  minimal  optical  flow.  Discontinu¬ 
ities  in  the  optical  flow  are  the  typical  example  of  the 
stable  qualitative  properties  the  optical  flow  can  deliver. 
Motion  discontinuities  can  be  computed  easily  in  many 
cases  (Reichardt  et  al.,  1983;  see  also  Little  et  ah,  these 
Proceedings).  They  are  also  important  for  the  integra¬ 
tion  of  visual  information  (Poggio,  1985b).  They  are 
stable  in  the  sense  that  they  are  likely  to  be  invariant 
to  quantitative  changes  in  the  way  the  optical  flow  is 
computed. 

In  this  chapter  we  discuss  another  example  of  sta¬ 
ble,  qualitative  properties  of  the  optical  flow.  We  in¬ 
troduce  the  qualitative  properties  associated  with  2-D 
dynamical  systems  and  show  how  to  process  minimal 
optical  flow  and  motion  field  for  making  them  equiva¬ 
lent  to  flows  of  dynamical  systems  on  the  plane.  We 
then  suggest,  from  properties  of  structural  stability  of 
dynamical  systems,  that  the  minimal  optical  flow  may 
be  equivalent  to  the  motion  field  in  terms  of  qualitative 
properties. 


4.1.  Smoothing  the  Optical  Flow  and  the 
Motion  Field 

In  order  to  establish  a  connection  with  the  theory  of 
stability  of  dynamical  systems,  we  must  insure  that  the 
optical  flow  and  the  motion  field  have  an  appropriate 
degree  of  smoothness.  This  is  not  always  the  case,  be¬ 
cause  of  discontinuities  arising  at  object  boundaries  or 
because  of  noise.  We  suggest  to  use  a  filtering  step  to 
smooth  the  field.  It  is  worthwhile  noticing  that  a  fil¬ 
tering  step  on  the  normal  component  of  a  dense  motion 
field  is  a  (regularization)  method  to  recover  the  whole 
2-D  motion  field.  3 


4.2.  Qualitative  Descriptions  of  Dynamical 
Systems 


For  a  rigorous  and  thorough  review  on  dynamical  sys¬ 
tems  see  Hirsch  and  Smale  (1974).  Here,  for  the  sake  of 
completeness,  we  summarize  the  main  definitions  and 
results. 


A  dynamical  system  is  a  C1  map  f>:  R  x  .4  — »  A, 
where  A  is  an  open  set  of  an  Euclidean  space  and  writ¬ 
ing  <t>(t,x)  =  4>t(x),  the  map  <f>,:  .4  — *  A  satisfies: 


(a)  A  — ►  A  is  the  identity; 

(b)  the  composition  di  (<£,(x))  =  for  each  t,s  e  R. 


A  dynamical  system  <j>t  on  A  gives  rise  to  a  differ¬ 
ential  equation  on  .4,  that  is  a  vector  field  y:  A  —*  E 
defined  as  follows: 


V (*)  = 


(4.3.1) 


Thus,  for  every  x,  y(x)  is  the  tangent  vector  to  the  curve 
t  — *  <t>t(i)  at  t  =  0.  In  the  following,  we  will  restrict  our 
attention  to  planar  systems,  (i.e.  in  what  follows,  A  will 
be  an  open  set  in  R2).  Solutions  like  x°  are  called  equi¬ 
librium  points  or  equilibria.  The  restriction  to  planar 
systems  reduces  the  classification  to  four  fundamental 
cases: 


I  :  The  saddle:  the  equilibrium  is  unstable  (an  equi¬ 
librium  is  stable  if  any  nearby  solutions  to  it  stays 
nearby  for  all  the  future  time.  It  is  unstable  other¬ 
wise). 


3This  ‘‘smoothed”  2  D  motion  field  may  not  be  the  same 
recovered  using  standard  algorithms,  but  its  qualitative 
properties  are  likely  to  be  preserved.  The  analogy  we  are 
about  to  present,  indeed,  will  support  this  argument  (and 
the  equivalence  between  qualitative  properties  of  the  2-D 
motion  field  and  the  optical  flow  as  well). 


831 


II :  The  sink  which  is  a  stable  equilibrium.  The  main 
property  of  a  sink  is  that 

lim  x It)  =  0 

l—oo  ’ 

III'.  The  source.  The  main  property  of  a  source  is  that 
lim  |x(t)|  =  oo 

I — »oo 

and 

lim  |x(t)|  =  0 
<—►  —  00 

.  A  source  can  be  considered  as  the  dual  case  of 
a  sink:  the  phase  portrait  of  a  source  and  of  the 
correponding  sink  are  the  same  except  that  for  the 
direction  of  the  motion  which  must  be  reversed. 

IV-  The  center.  All  the  solutions  are  periodic  with  the 
same  period.  A  center  is  a  stable  equilibrium.  A 
center,  however,  is  not  a  structurally  stable  prop¬ 
erty. 

The  crucial  point  is  that  this  classification  is  ex¬ 
haustive.  Every  solution  to  Equation  (4.3.1)  (in  the 
linear  case)  looks  like  a  saddle,  a  sink,  a  source,  or 
a  center.  The  same  classification  holds  for  the  non¬ 
linear  case.  However  non-linear  systems  are  interesting 
in  themselves,  since  they  can  show  a  different  qualita¬ 
tive  behavior.  A  non-linear  system  can  have  in  addition 
limit  cycles.  Intuitively,  a  limit  cycle  is  a  closed  orbit 
towards  which  other  solutions’  curves  spiral  with  the 
same  asymptotic  period.  Defining  a  u-limit  set,  Lu(x), 
as  Lu(x)  =  {a  €  A  such  that  3t„  — *  oo  with  x(tn)  — *  a} 
and  similarly  an  a-limit  set  La(x)  as  La(x)  =  {b  6  A 
such  that  3f„  -♦  — oo  with  x(tn)  — *  6},  a  limit  cycle 
is  a  closed  orbit  7  such  that  7  C  Lu(x)  or  7  C  La(x) 
for  some  1^7.  Under  somewhat  more  restrictive  con¬ 
ditions,  a  limit  cycle  can  be  a  periodic  attractor  (for  a 
rigorous  definition  of  it,  see  Hirsch  and  Smale  1974).  In¬ 
tuitively,  a  periodic  attractor  is  a  limit  cycle  such  that 
nearby  trajectories  not  only  have  the  same  asymptotic 
period  but  also  are  in  phase. 

Saddles,  sinks,  sources  and  periodic  attractors  are 
very  important  for  a  qualitative  description  of  planar 
systems.  It  can  be  shown  that  such  properties  are  struc¬ 
turally  stable ,  that  is  they  persist  after  a  perturbation  of 
the  right-hand  side  of  (4.3.2).  As  far  as  planar  systems 
are  concerned  they  also  fully  characterized  limit  sets. 
By  means  of  the  Poincare-Bendixon  theorem  it  can  be 
shown  that  compact  limit  sets  other  than  limit  cycles 
are  saddles,  or  sinks,  or  sources  or  trajectories  joining 
them. 

4.3.  Equilibria  and  their  Interpretations 
In  the  definition  of  dynamical  system  the  left-hand  side 


of  Equation  (4.3.1)  can  be  interpretated  as  a  vector  field 
tangent  to  the  family  of  curves  in  the  plane,  solutions 
to  (4.3.1)  itself.  It  is  straightforward  to  see  that  both 
the  smoothed  optical  flow  and  motion  field  (i.e.  after 
the  filtering  operation)  can  be  considered  os  istances  of 
such  a  vector  field  4.  Indeed,  it  is  sufficient  to  insure 
that  both  the  fields  are  continuous  with  continuous  first 
derivatives.  The  classification  of  the  solutions  can  now 
be  interpreted  in  terms  of  characteristic  points  of  the  2- 
D  motion  field.  A  source,  for  example,  corresponds  to  a 
focus  of  expansion  of  the  field.  The  structural  stability 
of  the  source,  in  turn,  says  that  a  focus  of  expansion  per¬ 
sists  even  if  the  field  is  perturbed.  From  this  perspective 
a  focus  of  expansion  is  expected  to  be  detectable  in  a 
2-D  motion  field  reconstructed  with  different  algorithms 
and  in  the  optical  flow  as  well,  when  they  can  be  con¬ 
sidered  as  perturbed  examples  of  the  “true”  2-D  motion 
field. 

4.4.  Discussion 

If  our  point  of  view  is  correct,  the  only  critical  property 
of  the  optical  flow  is  that  it  have  the  same  qualitative 
properties  of  the  2-D  velocity  field.  Quantitative  equiva¬ 
lence,  which  is  impossible  in  general,  is  in  any  case  irrel¬ 
evant  for  this  use  of  the  optical  flow.  As  a  consequence, 
many  different  “ optical  flows  ”  may  be  defined.  Equation 
(2.1.6)  does  not  have  any  priviliged  role:  other  defini¬ 
tions  could  be  preferred  on  the  basis  of  criteria  such  as 
computability  (from  image  data)  or  ease  of  implemen¬ 
tation  (for  given  hardware  constraints). 

This  point  of  view  has  clear  implications  for  bio¬ 
logical  visual  systems:  movement  detecting  cells  (say,  in 
the  retina)  do  not  have  to  compute  the  specific  minimal 
optical  flow  defined  by  equation  (2.1.6):  other,  possibly 
simpler,  estimates  of  the  velocity  field  that  preserve  its 
qualitative  properties  are  equally  good  candidates  (such 
as  correlation-like  algorithms).  This  argument  may  ex¬ 
plain  why  the  models  proposed  to  explain  motion  depen¬ 
dent  behaviour  in  insects  (Hassenstein  and  Reichardt, 
1956),  motion  perception  in  humans  (Van  Santen  and 


4  We  stress  the  fact  that  the  analogy  with  the  dynamical 
system  is  between  phase  portraits  of  dynamical  systems 
and  motion  flows.  The  parameter  t  in  the  definition  of 
dynamical  system  is  not  the  physical  time.  We  consid¬ 
ered  motion  flows,  such  as  the  2-D  motion  field  or  the  op¬ 
tical  flow  at  a  fixed  time,  comparing  them  with  the  vector 
field  tangent  to  the  phase  portrait  of  some  system:  we  are 
not  interested  in  the  physical  meaning  of  the  underlying 
dynamical  system. 


832 


Sperling,  1984)  and  physiology  of  cells  (Barlow  and  Lev- 
ick,  1965;  Torre  and  Poggio,  1978)  are  all  implementing 
computations  quite  different  from  the  minimal  optical 
flow  as  it  is  usually  defined  (see  equation  2.1.6).  In  ad¬ 
dition  all  these  models  do  not  typically  measure  velocity 
-  not  even  in  the  case  of  uniform  translation  in  a  fron- 
toparallel  plane.  Even  for  simple  motions  of  the  latter 
type  the  output  of  models  such  as  the  correlation  models 
depends  on  both  the  velocity  and  the  spatial  structure 
of  the  moving  pattern.  One  is  tempted  to  consider  this 
as  a  weakness  of  these  models  compared  to  the  defini¬ 
tion  of  minimal  optical  flow,  Equation  2.1.7.  Our  re¬ 
sults,  however,  show  that  this  is  not  the  case:  first,  the 
minimal  optical  flow  is  correct  only  in  a  very  special 
situation;  second,  all  these  models  may  have  the  same 
qualitative  properties  of  the  motion  field,  which,  from 
our  point  of  view,  is  the  only  critical  requirement  for 
a  “good”  measurement  of  motion.  The  next  question 
is  of  course  whether  these  biological  models  are  in  fact 
“close”  enough  to  the  motion  field  to  share  the  same 
qualitative  properties.  We  do  not  know  the  answer  yet. 
We  conjecture,  however,  that  they  are  indeed  usually 
similar  enough  to  preserve  the  main  qualitative  prop¬ 
erties  of  the  motion  field.  The  conjecture  is  based  on 
results  (Poggio  and  Reichardt,  1973  and  Poggio,  1985b) 
showing  that  most  of  the  biological  models  proposed  so 
far  can  be  considered  as  special  instances  or  approxima¬ 
tions  of  a  general  class  of  nonlinear  models  (character¬ 
ized  as  Volterra  systems  of  the  second  order);  and  that 
the  minimal  optical  flow,  as  defined  in  equation  ,  is  also 
approximately  a  Volterra  functional  of  the  second  order 
(Poggio,  1985a). 

It  is  important  to  stress  that  the  approach  outlined 
in  the  second  part  of  this  paper  for  classifying  the  qual¬ 
itative  properties  of  the  optical  flow  is  only  one  of  the 
possible  methods.  While  we  plan  to  develop  further  that 
particular  approach,  others  should  be  explored  as  well: 
in  particular  flows  that  do  not  correspond  to  dynamical 
systems  on  the  plane  may  be  better  suited  for  captur¬ 
ing  important  and  stable  properties  of  the  velocity  field 
such  as  motion  discontinuities  (think  of  hydrodynam¬ 
ics).  In  this  case,  the  classification  of  qualitative  prop¬ 
erties  should  take  place  without  a  preliminary  smooth¬ 
ing  operation.  This  approach  would  have  the  advantage 
of  preserving  the  discontinuities  of  the  field  that,  as  we 
argued,  represent  very  useful  information. 

In  addition  to  the  classification  of  stable  qualita¬ 
tive  properties  of  the  velocity  field,  much  work  needs 
to  be  done  at  the  level  of  their  interpretation  in  terms 
of  3-D  structure  and  3-D  velocity.  Some  of  the  quali¬ 
tative  properties  of  the  (smoothed)  velocity  field  have 
an  easy  interpretation  in  those  terms:  an  obvious  ex- 
:  ample  is  again  a  focus  of  expansion  that  is  related  to 


“crashing”  motion.  It  is  likely  that  many,  more  subtle 
relations  exist  between  the  qualitative  properties  of  the 
flow  and  the  underlying  3-D  motion  and  structure.  For 
example,  preliminary  results  by  Torre  et  al.  (personal 
communication)  suggest  that  the  number  of  focuses  in 
the  (smoothed)  field  may  be  characteristic  for  the  rigid¬ 
ity  of  motion  in  the  visible  scene. 

Finally,  we  should  mention  an  obvious  extension 
of  the  approach  described  in  the  second  part  of  the 
paper.  We  have  only  considered  so  far  the  velocity 
field  “frozen”  at  a  given  instant  of  time.  The  succes¬ 
sion  of  image  frames  provides  in  fact  a  time-dependent 
field:  the  evolution  in  time  of  the  qualitative  properties 
we  have  described  -  how  they  are  created,  disappear 
and  transform  -  should  be  characterized  in  qualitative 
terms,  for  instance  using  the  language  of  catastroiAe 
and  bifurcation  theory.  The  use  of  time-dependent  fields 
should  be  practically  much  more  robust,  because  of  the 
redundant  information  available  in  a  sequence  of  very 
closely  spaced  frames  (in  time).  Our  analysis  should  be 
extended  to  qualitative  properties  that  are  structurally 
stable  not  only  at  a  given  time  but  also  in  the  time 
dependent  field. 


ACKNOWLEDGEMENTS 

We  would  like  to  thank  Vincent  Torre,  Davi  Geiger, 

B.  Caprile,  Berthold  K.P.  Horn,  and  especially  Bror 

Saxberg  for  useful  discussions. 

REFERENCES 

Barlow,  H.B.  and  R.W.  Levick,  (1965)  The  mechanism 
of  directional  selectivity  in  the  rabbit’s  retina,  J. 
Physiol.  173:  477-504. 

Born,  M.  and  Wolf,  E.  1959  Principles  of  Optics.  New 
York:  Pergamon  Press. 

Gibson,  J.  J.  1950.  The  Perception  of  the  Visual  World. 
Boston:  Houghton  Mifflin. 

Fennema,  C.  L.,  Thompson,  W.  B.  1979.  Velocity  de¬ 
termination  in  scenes  containing  several  moving  ob¬ 
jects.  Comput.  Graph.  Image  Proc.  9:301-315. 

Hassenstein,  B.  and  Reichardt,  W.  (1956)  Sys- 
temtheoretische  Analyse  der  Zeit-,  Reihenfolgen- 
und  Vorzeichenauswertung  bei  der  Bewgungs- 
perzeption  der  Russelkafers,  Chlorophanus.  Z. 
Naturforsch.  lib:  513-524. 


833 


AUTHORS 


Hildreth,  E.  C.  1984a.  The  Measurement  of  Visual  Mo¬ 
tion.  Cambridge:  MIT  Press. 

Hildreth,  E.  C.  1984b.  The  Computation  of  the  velocity 
Held.  Proc.  R.  Soc.  London  0  221:189-220. 

Hirsch,  M.  W.  and  Smale,  S.  1974  Differential  Equations, 
Dynamical  Systems,  and  Linear  Algebra.  New 
York:  Academic  Press. 

Horn,  B.K.P.  Robot  Vision,  Cambridge  and  New  York: 
Massachusetts  Institute  of  Technology  Press  and 
McGraw-Hill,  19S6. 

Horn,  B.  K.  P.,  Schunck,  B.  G.  19S1.  Determining  op¬ 
tical  flow.  Artif.  Intell.  17:185-203. 

Horn,  B.K.P.  and  Sjoberg,  R.W.  1979.  Calculating  the 
reflectance  map.  Applied  Optics  18:  1770-1779. 

Kanatani,  Iv.  1985.  Structure  from  motion  without  cor¬ 
respondence:  general  principle.  Proc.  Image  Un¬ 
derstanding  Workshop,  Miami,  FL,  pp.  107-116. 

Marr,  D.,  Ullman,  S.  1981.  Directional  selectivity  and 
its  use  in  early  visual  processing.  Proc.  R.  Soc. 
London  Ser.  5  211:151-180. 

Nagel.  H.-H.  1984.  Recent  advances  in  image  sequence 
analysis.  Proc.  Premier  Collogue  linage  —  Traite- 
ment.  Synthese,  Technologie  et  Applications,  Biar¬ 
ritz,  France,  May,  pp.  545-558. 

Phong,  B.T.  1975  Illumination  for  computed  generated 
pictures.  Communications  of  the  ACM  18:311-317. 

Poggio,  T.  1985a  Cold  Spring  harbor  lecture,  unpub¬ 
lished. 

Poggio,  T.  1985b  “Integrating  Vision  Modules  with 
Coupled  MRF’s,”  Massachusetts  Institute  of  Tech¬ 
nology  Artificial  Intelligence  Laboratory  Working 
Paper  285. 

Poggio,  T.,  Reichardt,  W.  1973.  Considerations  on 
models  of  movement  detection.  Kybernetik  IS,  223- 
227. 

Reichardt,  W.,  Poggio,  T.,  Hausen,  K.  1983.  Figure- 
ground  discrimination  by  relative  movement  in  the 
visual  system  of  the  fly.  Part  II:  Towards  the  neural 
circuitry.  Biol.  Cyber.  46:1-30. 

Schunck,  B.  G.,  Horn,  B.  K.  P.  1981.  Constraints  on 
optical  flow  computation.  Proc.  IEEE  Conf.  Patt. 
Recog.  Image  Proc.,  August,  pp.  205-210. 

Torre,  V.  and  T.  Poggio,  (1978)  A  Synaptic  mechanism 
possibly  underlying  directional  selectivity  to  mo¬ 
tion.  Proc.  R.  Soc.  Lond.  B  202:  409-416. 

Van  Santen,  J.  P.  H.,  Sperling,  G.  1984.  A  temporal  co- 
variance  model  of  motion  perception.  J.  Opt.  Soc. 
Am.  A,  1:451-473. 


Tomaso  Poggio:  Artificial  Intelligence  Laboratory, 
NE43-787;  Massachusetts  Institute  of  Technology,  545 
Technology  Square;  Cambridge,  MA  02139  Alessandro 
Verri:  Universita  di  Genova;  Dipartimento  di  Fisica; 
Via  Dodecanso  33;  16146  Genova,  Italy 


834 


Algorithm  Synthesis  for  IU  Applications 

Michael  R.  Lowry 

AI  Lab,  Stanford  University,  Stanford,  California  91305 


Abstract 

Computer  vision  algorithms  incorporate  substantial 
domain  knowledge  and  programming  knowledge.  This 
paper  describes  the  capabilities  needed  for  automatic 
programming  in  the  domain  of  Image  Understanding 
(IU)  algorithms.  A  general  model  of  algorithm  deriva¬ 
tion  is  given  based  upon  an  analysis  of  papers  in  the 
December  1985  IU  workshop  proceedings.  Parts  of  this 
general  model  are  being  implemented  in  the  STRATA 
automatic  programming  system.  The  derivation  of  a 
non-trivial  geometric  optimization  algorithm,  the  Kar- 
marker  linear  optimization  algorithm,  is  given  to  illus¬ 
trate  the  methods  used  by  STRATA. 

1  Introduction 

Computer  vision  presents  unique  opportunities  and 
diallenges  to  automatic  programming  research.  Com¬ 
puter  vision  algorithms  incorporate  substantial  domain 
utowledge  from  areas  such  as  geometry,  photometry, 
and  probability.  Computer  vision  is  computationally 
ntensive,  yet  even  research  prototypes  must  have  rea¬ 
dable  running  times  in  order  to  adequately  test  ideas 
against  real  images.  Thus  computer  vision  research 
iraws  upon  sophisticated  knowledge  in  numerical  anal¬ 
ysis,  analysis  of  algorithms,  and  computational  geom- 
:try.  Finally  computer  vision  is  a  rapidly  developing 
ield  with  new  concepts,  theories,  methods,  and  algo- 
ithms  reported  every  year. 

This  paper  describes  research  in  progress  for  partially 
automating  computer  vision  algorithm  development, 
n  the  near  term,  the  methods  which  have  been  de¬ 
veloped  could  be  used  to  implement  an  intelligent  pro¬ 
gramming  assistant  for  computer  vision  programming. 
The  rest  of  this  introduction  briefly  describes  some  pre¬ 
vious  automatic  programming  research  in  knowledge 
ntensive  domains.  It  also  overviews  the  methods  of 
parameterized  theory  instantiation  and  invariant  logic. 
Che  second  section  presents  a  general  model  of  Image 
Understanding  algorithm  derivation  with  a  brief  anal¬ 
ysis  of  selected  papers  from  the  last  IU  workshop.  The 
!hird  section  is  a  conceptual  overview  of  the  Karmarker 


linear  programming  algorithm.  This  algorithm  resem¬ 
bles  many  of  the  non-linear  optimization  algorithms 
that  arise  in  computer  vision.  The  Karmarker  algo¬ 
rithm  is  also  a  good  illustration  of  the  automatic  pro¬ 
gramming  methods  I  have  developed.  The  last  section 
is  a  mechanizeable  derivation  of  a  Karmarker  algorithm 
variant  based  upon  parameterized  theory  instantiation 
and  invariant  logic. 

Most  research  in  automatic  programming  has  been 
in  knowledge-poor  domains,  where  the  primary  com¬ 
plexity  is  in  algorithm  derivation.  An  exception  is  the 
work  of  Barstow  et  al  [Bar86]  on  synthesizing  pro¬ 
grams  for  the  domain  of  oil  exploration  at  Schlum- 
berger.  Barstow  finds  that  the  complexity  of  oil  ex¬ 
ploration  software  arises  from  the  wealth  of  the  oil 
exploration  problem  domain,  not  the  algorithms  em¬ 
ployed.  In  contrast,  computer  vision  software  is  com¬ 
plex  due  to  both  the  wealth  of  the  problem  domain 
and  the  sophistication  and  variety  of  the  algorithms 
which  are  employed.  The  major  challenge  of  computer 
vision  programming  is  the  interaction  of  sophisticated 
domain  knowledge  and  sophisticated  algorithm  knowl¬ 
edge.  The  same  conclusion  has  been  reached  by  Kant 
[KN83]  who  has  examined  the  domain  of  computational 
geometry. 

The  idea  of  formalizing  high  level  algorithm  schemas, 
such  as  iterative  convergence  and  hill  climbing,  is  well 
known  although  their  use  in  automatic  programming 
is  fairly  recent  [Smi85]  [Ric86] .  In  my  research, 

the  applicability  conditions  for  an  algorithm  schema 
are  represented  declaratively  as  a  parameterized  theory 
[GB85],  Actually,  the  relationship  between  the  abstract 
input/output  conditions,  the  problem  structure  used  by 
an  algorithm,  and  the  algorithm  schema  itself  is  fairly 
complex  and  has  usually  been  given  a  schema  specific 
procedural  encoding.  A  general  declarative  represen¬ 
tation  is  necessary  for  cleanly  specifying  the  incorpo¬ 
ration  of  many  different  schemas  into  one  particular 
algorithm,  and  also  for  facilitating  the  use  of  different 
search  techniques  in  instantiating  an  algorithm  schema. 
The  details  of  the  declarative  representation  for  algo¬ 
rithm  schemas  can  be  found  in  [Low87],  this  paper 
illustrates  their  use  in  algorithm  synthesis  through  pa- 


835 


rameterized  theory  instantiation.  The  mathematical 
theory  of  parameterized  theories  has  been  developed  by 
theoreticians  applying  mathematical  logic  to  abstract 
data  types  [GTW78], 

Another  area  of  intensive  research  in  automatic  pro¬ 
gramming  has  been  the  automatic  selection  of  efficient 
data  types  for  set  theoretic  constructions.  This  is  the 
concern  of  the  SETL  project  [SSS86].  Yet  most  scien¬ 
tific  programming  that  is  related  to  real  world  physics 
relies  upon  the  efficient  implementation  of  geometric 
constructions  and  operations.  This  is  certainly  true  of 
IU  (image  understanding)  and  robotics. 

My  research  on  automatic  programming  develops  a 
framework  for  choosing  efficient  data  types  for  geomet¬ 
ric  constructions.  This  framework  is  based  upon  the 
modern  approach  to  Geometry  first  proposed  by  Felix 
Klein: 

A  geometry  is  the  study  of  the  properties  of 
a  set  S  which  remain  invariant  when  the  ele¬ 
ments  of  a  set  S  are  subjected  to  the  transfor¬ 
mations  of  some  transformation  group. 

The  role  of  invariants  and  transformation  groups  is 
central  to  modern  geometry  and  physics,  in  fact  Ein¬ 
stein  at  one  time  called  his  theory  Invariant  Theory. 

I  apply  this  idea  to  automatic  programming  by  find¬ 
ing  the  invariants  of  a  problem  for  which  an  algorithm 
is  to  be  synthesized.  The  invariants  of  a  problem  indi¬ 
cate  in  what  abstract  theory  a  problem  should  be  for¬ 
mulated,  and  what  transformations  can  be  applied  in 
choosing  an  efficient  formulation  of  the  problem.  The 
idea  of  reformulating  a  problem  using  a  transformation 
which  leaves  the  io-behavior  invariant  is  the  source  of 
power  for  the  Karmarker  linear  optimization  algorithm, 
whose  synthesis  will  be  described  in  this  paper. 

2  Model  of  Algorithm  Derivation  for 
IU  applications 

The  following  model  of  algorithm  derivation  for  im¬ 
age  understanding  is  based  upon  my  own  experience 
in  computer  vision  [LM81]  [Fea82]  [LM82]  [Low82] 
[WBKL83]  and  my  association  with  Stanford’s  vision 
group.  The  model  will  be  illustrated  with  a  sampling 
of  those  papers  in  the  1985  IU  proceedings  whose  ti¬ 
tle  indicated  an  emphasis  on  algorithms1  :  [PRH85], 
[Lee85],  [CK85].  Parts  of  this  model  are  being  mecha¬ 
nized  in  my  research  on  automatic  programming,  while 
other  parts  will  be  studied  in  future  research.  Total 
success  in  mechanizing  this  model  would  automate  IU 

'Thank*  to  Professor  Binford  for  this  suggestion. 


research!  Most  likely,  this  research  will  lead  to  the  de¬ 
velopment  of  an  intelligent  programming  assistant  for 
rapid  prototyping  of  IU  algorithms  and  for  verifying 
the  correctness  of  IU  algorithms. 

The  general  model  of  IU  Algorithm  derivation: 

1.  Formulate  an  ill-structured  problem  as  a  formal 
mathematical  problem.  Examples  of  ill-structured 
problems  are  recovering  depth  from  shading,  de¬ 
riving  motion  parameters  from  successive  camera 
frames,  or  interpolating  depth  information  from 
sparse  data.  The  format!  problem  is  often  formu¬ 
lated  as  finding  a  solution  to  a  set  of  equations,  or 
an  optimal  solution  to  a  set  of  equations. 

2.  Instantiate  several  known  algorithm  schemas  to 
derive  a  skeleton  algorithm  which  solves  the  for¬ 
mal  problem.  For  example,  lull  climbing,  iterative 
convergence,  and  coarse-to-fine  are  ail  algorithm 
schemas  that  were  used  in  the  papers  mentioned 
above. 

3.  The  skeleton  algorithm  is  instantiated  down  to  the 
level  of  language  independent  subroutines,  such  as 
gaussian  elimination  or  linear  least  squares.  These 
subroutines  would  be  reasonably  expected  to  be  in 
a  scientific  library  package. 

The  following  papers  from  last  year's  IU  proceedings 
illustrate  this  model  of  algorithm  derivation: 

Pavlin  [PRH85]  analyzed  an  algorithm  for  detecting 
translational  motion.  The  ill-structured  problem  is  to 
determine  vehicle  motion  using  vision.  This  problem 
is  formalized  as  an  optimization  problem,  namely  min¬ 
imizing  the  global  error  for  the  parameters  of  transla¬ 
tional  motion  given  correlation  of  feature  in  successive 
camera  frames.  The  top  level  of  the  algorithm  uses  a 
coarse  to  fine  search  to  determine  the  parameters  of 
translational  motion.  Within  each  level  of  coarseness, 
a  hill  climbing  algorithm  is  used  to  find  the  parameter 
values  with  minimal  error.  The  knowledge  needed  for 
this  algorithm  includes  probability,  geometry,  and  the 
search  schemas  of  coarse-to-fine  and  hill  climbing. 

Lee  [Lec85]  presents  an  iterative  algorithm  for  shape 
from  shading.  The  ill-structured  problem  is  to  deter¬ 
mine  shape  from  shading  and  occluding  boundaries. 
This  problem  is  formalized  as  an  optimization  problem 
over  a  set  of  non-linear  equations.  The  paper  shows 
existence  and  uniqueness  of  a  minimal  solution  of  the 
non-linear  equations,  and  proves  that  a  new  iterative 
algorithm  based  on  spline-smoothing  converges. 

Choi  [CK85]  presents  a  parallel  algorithm  for  solving 
the  depth  interpolation  problem.  They  state  in  their 
abstract:  “Many  low  level  computer  vision  problems, 
including  depth  interpolation,  can  be  cast  as  solving  a 


836 


symmetric  positive  definite  matrix.”  First,  the  depth 
interpolation  problem  is  formalized  as  solving  a  set  of 
linear  equations.  Because  of  the  special  form  of  the  ma¬ 
trix,  various  iterative  methods  with  fast  convergence 
rates  are  applicable.  The  major  part  of  the  paper  de¬ 
scribes  the  parallelization  of  an  adaptive  chebyshev  ac¬ 
celeration  method,  using  a  a  parallel  tree  architecture. 

The  first  step  of  formalizing  an  ill-structured  prob¬ 
lem  involves  many  creative  aspects  based  on  experience 
and  judgement.  Simplifying  assumptions  are  invoked 
which  hopefully  do  not  exclude  the  real  world  prob¬ 
lems  which  motivated  the  research.  This  is  why  testing 
against  real  data  is  the  ultimate  verification  of  an  IU 
algorithm.  Methods  for  deriving  an  algorithm  given  a 
formal  mathematical  specification  are  described  in  sub¬ 
sequent  sections. 

3  Linear  Optimization 

Linear  optimization  is  finding  the  optimum  value  of  a 
linear  cost  function  given  a  set  of  linear  constraints.  As 
shown  in  the  previous  section,  mathematical  optimiza¬ 
tion  is  at  the  core  of  many  IU  algorithms.  Sometimes 
the  cost  function  or  the  set  of  constraints  is  non-linear, 
but  the  derivation  of  these  algorithms  have  much  in 
common  with  the  linear  case.  The  derivation  of  a  vari¬ 
ant  of  the  Karmarker  algorithm  is  used  to  illustrate 
my  methods  of  automatic  programming  and  problem 
reformulation.  This  section  presents  an  overview  of 
linear  programming  algorithms  and  the  reformulation 
methods  used  to  synthesize  them  from  specification. 
The  next  section  presents  the  details  of  the  algorithm 
derivation. 

A  linear  cost  function  over  Rs  is  expressed  as  a  func¬ 
tion  of  the  following  form: 

<*lxl  +  aox  2  +  ...anxn 

In  li  near  optimization,  the  objective  is  to  minimize 
(or  equivalently  maximize)  this  function  over  a  domain 
which  is  defined  by  a  set  of  linear  equalities  and  in¬ 
equalities: 

anxi  +  «12X2  +  •■■ainXn  =  l>\ 

^mlxl  T  Um2x2  “h  *..Umnxn  ^ 

From  the  more  abstract  viewpoint  of  Euclidean  ge¬ 
ometry,  the  linear  constraints  describe  a  convex  polyhe- 
dra  (possibly  unbounded  or  null),  and  the  cost  function 
describes  a  direction.  The  desired  output  is  the  point 
on  the  polyhedra  which  is  furthest  along  the  direction 
vector. 

The  insight  of  the  simplex  algorithm  is  that  the  out¬ 
put  will  be  a  vertex  of  this  polyhedra  (or  at  least  in¬ 
clude  a  vertex,  in  degenerate  cases  where  an  entire 
edge  or  plane  is  perpendicular  to  the  direction  vector). 
The  skeleton  of  the  simplex  algorithm  is  hill  climb¬ 


ing  between  adjacent  vertices  until  a  local  optimum 
is  reached.  Because  of  convexity,  a  local  optimum  is 
guaranteed  to  be  a  global  optimum. 

The  insight  of  the  Karmarker  algorithm  is  that  the 
linear  optimization  problem  is  abstractly  a  problem  of 
oriented  projective  geometry.  This  means  that  only  in¬ 
cidence  (e.g.  point  p  is  on  line  1)  and  sidedness  (e.g. 
points  p  and  q  are  on  opposite  sides  of  line  1)  are  es¬ 
sential  properties  for  solving  linear  optimization  prob¬ 
lems.  Distance  metrics,  angle  metrics,  even  parallelism 
axe  irrelevant.  In  other  words,  linear  optimization  is 
invariant  under  projective  transformations  which  pre¬ 
serve  orientation.  The  modified  Karmarker  algorithm 
is  somewhat  less  general,  using  the  invariance  of  linear 
optimization  under  affine  transformations.  The  differ¬ 
ence  is  that  the  modified  version  preserves  parallelism, 
consequently  it  is  not  a  provably  polynomial  time  algo¬ 
rithm.  However,  both  the  algorithm  and  the  derivation 
are  easier  to  explain,  while  preserving  the  flavor  of  the 
Karmarker  algorithm.  On  practical  problems,  the  Kar¬ 
marker  algorithm  and  the  modified  version  have  similar 
performance. 

The  skeleton  algorithm  for  the  Karmarker  algorithm 
and  related  algorithms  that  travel  through  the  center  of 
the  polyhedra,  as  opposed  to  the  boundary  of  the  poly¬ 
hedra,  is  an  iterative  approximation  algorithm.  Each 
iteration  begins  with  the  polyhedra,  the  direction  vec¬ 
tor,  and  an  approximate  interior  optimal  point  and  de¬ 
rives  a  better  interior  approximation  to  the  optimal 
point.  Convergence  is  measured  by  the  number  of  bits 
of  accuracy  of  the  approximate  point  to  the  actual  opti¬ 
mal  point,  so  the  iteration  resembles  Newton’s  method. 

The  major  constraint  of  the  inner  loop  is  that  the 
new  approximation  is  an  interior  point.  This  constrains 
how  long  a  step  length  can  be  taken.  The  solution  is 
to  reformulate  the  approximation  problem  for  each  in¬ 
ner  loop  to  a  canonical  approximation  problem  using 
affine  or  projective  transformations.  The  inverse  trans¬ 
formation  is  used  to  map  the  new  approximation  in  the 
reformulated  problem  back  to  a  new  approximation  in 
the  original  problem  space.  Basically,  the  canonical  ap¬ 
proximation  problem  assumes  that  the  interior  point  is 
at  location  (1,  1,  1,  l...l)r,  and  the  problem  is  to  find  a 
better  point  whose  co-ordinates  are  non-negative,  i.e. 
which  is  in  the  positive  orthant.  The  solution  is  sim¬ 
ple  -  take  a  unit  step  length  in  the  optimal  direction. 
This  basic  idea  is  modified  to  define  a  polyhedra  as 
the  intersection  of  the  positive  orthant  and  an  affine 
subspace.  In  this  case  the  old  approximation  and  the 
new  approximation  are  projected  onto  the  affine  sub¬ 
space.  An  affine  subspace  is  defined  by  the  following 
parametric  equation: 

Ax  =  6 


837 


'%  Direction  Vector 

\ 

Optimal 
Point 


Polyhedra 

Starting 

Point 


Simplex  Algorithm 
Path 


Karmarker 

Path 


Algorithm 


Linear  Optimization 
Algorithms 


Reformulation  in  Inner 
Loop  of  Modified  Karmarker 
Algorithm  to  Canonical 
Problem 


838 


Thus  the  skeletal  algorithm  is  an  iterative  approx¬ 
imation  with  the  inner  loop  requiring  a  transforma¬ 
tion  to  a  canonical  approximation  problem,  solving  the 
canonical  approximation  problem  (using  a  linear  least 
squares  approximation  for  projecting  onto  the  affine 
subspace),  and  then  transforming  the  new  approxima¬ 
tion  back  onto  the  original  problem  space. 

4  Karmarker  Algorithm  Derivation 

The  Karmarker  polynomial  time  algorithm  for  linear 
programming  has  sparked  tremendous  interest  in  the 
operations  research  community  and  has  inspired  oth¬ 
ers  to  investigate  similar  methods.  The  first  theoreti¬ 
cally  polynomial  time  algorithm  for  linear  optimization 
[Kha79]  was  not  a  practical  alternative  to  the  simplex 
algorithm.  In  practice  the  simplex  algorithm  is  linear 
in  the  number  of  constraints,  though  in  the  worst  case 
can  be  exponential  in  the  number  of  variables.  The 
Karmarker  algorithm  and  some  variants  are  both  theo¬ 
retically  polynomial  and  are  competitive  with  the  sim¬ 
plex  algorithm  in  practice.  These  iterative  algorithms 
are  based  upon  repeated  projective  transformations  to 
a  canonical  problem  of  optimization  over  a  sphere  in¬ 
scribed  in  the  feasible  region.  It  is  this  reformulation 
to  a  canonical  problem  which  is  of  major  interest  with 
respect  to  my  reformulation  methods. 

In  this  section  a  simple  variant  based  upon  repeated 
affine  transformations  is  derived.  This  algorithm  has 
the  same  skeleton  as  the  original  Karmarker  algorithm, 
and  appears  to  have  substantial  practical  value.  In¬ 
terestingly,  this  algorithm  is  not  theoretically  polyno¬ 
mial  time.  This  is  because  affine  geometry  is  not  as 
abstract  as  projective  geometry  (affine  geometry  pre¬ 
serves  parallelism),  and  hence  projective  transforma¬ 
tions  are  more  powerful  and  can  lead  to  theoretically 
faster  convergence.  Despite  being  theoretically  less 
powerful,  this  simple  variant  illustrates  the  abstrac¬ 
tion/implementation  method  of  reformulation. 

The  key  to  the  abstraction/implementation  method 
of  reformulation,  as  applied  to  geometrical  problems, 
is  to  find  the  transformations  which  leave  the  prob¬ 
lem  invariant.  This  information  can  then  be  used  in 
two  ways.  In  the  first  method,  called  canonicaliza- 
tion  which  is  illustrated  here,  these  transformations  are 
used  to  reformulate  a  problem  to  a  canonical  problem. 
For  example,  in  a  combinatorial  problem  involving  lists 
where  the  problem  is  invariant  under  reordering  of  the 
lists,  then  the  lists  could  be  sorted  to  make  the  solu¬ 
tion  more  efficient.  In  the  second  method,  which  is 
developed  in  [Low87],  the  definition  of  the  problem 
is  abstracted  into  terms  which  are  invariant  under  the 
transformations.  For  example,  lists  which  can  be  re¬ 


ordered  would  be  abstracted  to  bags.  This  method  is 
more  powerful  than  the  first  since  any  implementation 
of  bags  can  be  chosen,  not  just  canonicalized  lists. 

The  formal  specification  of  the  problem  from  which 
an  algorithm  will  be  derived  is  the  standard  form  for 
linear  optimization2: 

Minimize  cx 

such  that  Ax  =  b  and  X) . xn  >  0 

The  first  step  in  the  invariant  method  is  to  symbol¬ 
ically  solve  for  transformations  which  leave  the  prob¬ 
lem  invariant.  An  alternative  to  carrying  out  symbolic 
matrix  algebra  is  Invariant  Logic  [Low87],  Invariant 
Logic  solves  for  problem  invariants  using  the  invariants 
of  known  relations: 

I nvariant(I ncidence ,  ProgectiveT ransf  ormations) 
I nvariant(Or dering ,  OrientedProj eciiveT ram) 

I  nvariani(N  onN  egative,  Positive  Scaling)' 

Groups  of  invertible  transformations  (over  an  N  di¬ 
mensional  vector  space  defined  on  a  Field  F)  form  a 
lattice  under  inclusion,  'the  bottom  of  the  lattice  is 
the  singleton  group  consisting  of  the  identity  transfor¬ 
mation.  The  top  of  the  lattice  is  all  possible  invertible 
transformations.  A  problem  is  guaranteed  to  be  in¬ 
variant  under  the  intersection  of  the  invariants  of  the 
relations  which  define  the  problem.  This  intersection 
can  be  found  through  the  join  of  the  invariants  of  the 
composing  relations,  which  can  be  thought  of  as  the 
greatest  common  denominator.  This  simple  method 
will  not  always  find  the  most  general  group  of  trans¬ 
formations  which  leave  a  problem  invariant,  for  a  more 
powerful  method  see  [l.owS7], 

I  he  simplified  method  of  Invariant  Logic  is  to  c  las¬ 
sify  the  constituent  parts  of  the  problem  definition  by 
their  invariants  and  then  to  intersect  these  invariants. 
For  example,  A.r  =  h  is  an  incidence  relation  and 
hence  invariant  under  all  projective  transformations. 
Min  !/?!?:(  rx  is  an  ordering,  relation  and  an  incidence 
relation,  so  it  is  invariant  under  oriented-projecti ve- 
transforuations.  After  applying  Invariant  Logic  to  this 
definition  of  linear  opt  i mi/at  ion  t  li rough  simple  forward 
chaining,  the  following  theorem  is  derived: 

!  nvanant(lji  nt  nrOptiwi  zntinn .  Positive  Srnhn  7 ) 1 
The  method  of  parameterized  theory  instantiation  is 
illustrated  by  instantiating  the  schema  for  iterative  im- 

2Some  authors  call  this  the  canonical  form 
^Although  positive  translations  also  leave  non-negativity  in¬ 
variant,  their  inverses,  negative  translations,  do  not  leave  non¬ 
negativity  invariant.  Actually,  non-negative  preserving  invertible 
transformations  are  more  general  than  positive  scaling  transfor¬ 
mations.  For  example,  Karmarker  used  a  projective  transforma¬ 
tion  that  preserved  non-negativity 

^Only  in  the  canonicalization  method  does  non-negativity 
need  to  be  kept  invariant.  In  the  more  general  abstrac- 
tion/implementation  method,  the  non-negativity  constraint  is 
abstracted  to  the  intersection  of  half-spaces 


839 


provement  for  linear  optimization.  This  forms  the  top- 
level  skeleton  of  the  Karmarker  algorithm.  The  first 
step  is  to  instantiate  the  abstract  input-output  relation 
to  the  actual  input-output  relation.  This  involves  gen¬ 
erating  a  substitution  for  the  parameters  D,S,out,and 
Cost-Relation  such  that  the  instantiated  axioms  are 
provable.  For  iterative  improvement,  the  axioms  of 
the  abstract  input-output  relation  are  that  the  cost- 
relation  between  feasible  outputs  is  total  and  transi¬ 
tive. 

Iterative  Improvement  Schema5 
Abstract  Input-Output  Relation: 

Domain  D 
Cost-Relation:  DxD 
Input  S  C  D 
Output  out  6  S 

IORelation  Vy  6  S  CostRelation(out,  y ) 

Axioms 

Vx,  y  €  D  CostRelation(x,  y)  OR  CostRelation(y,  x) 
Vx.y.i  € 

D  CostRelation(x,  y)  AND  Cost Relation(y,  z)  => 
CostRelation(x,  c) 

Additional  Problem  Structure: 

Better-element:  D  —  D 

Axioms  for  additoional  problem  structure: 

Vi  E  D  Better EUnunt(x)  6  D 

Vx  g  D  C'ostRelation(x ,  Better Element(x)) 

The  following  instantiation  of  the  abstract  input- 
output  relation  successfully  matches  the  formal  spec¬ 
ification  for  linear  optimization: 

D  -  Ry 

S  >-  (i  |  Ax  =  6  AN D  x,  >  0} 

CostRelation(x,  y)  >->  cx  >  cy 

The  next  step  is  to  instantiate  the  additional  prob¬ 
lem  structure  which  forms  the  applicability  conditions 
of  iterative  improvement  for  optimization  problems. 
From  an  AI  viewpoint,  this  is  a  search  problem  which 
uses  constraint  propagation.  In  particular,  the  instan¬ 
tiation  of  the  abstract  input-output  relation  partially 
constrains  the  choice  of  the  better-element  function. 
An  additional  source  of  constraint  is  efficiency  criteria, 
which  are  discussed  in  [Low87]. 

For  this  exa.iple  of  instantiating  iterative  improve¬ 
ment,  the  additional  problem  structure  which  is  par¬ 
tially  constrained  is  the  better-element  function.  The 
constraints  on  this  function  are  that  its  output  be  a 
member  of  S,  and  that  the  output  has  better  cost  than 
the  input,  as  expressed  in  the  two  axioms.  It  is  the  first 
constraint,  that  the  output  be  a  member  of  S,  which 
generates  the  canonicalization  reformulation. 

5Newton’»  Method,  among  others,  is  a  specialization  of  this 

schema. 


Using  the  following  domain  knowledge  about  linear 
least  squares,  the  constraint  that  the  output  of  Better- 
element  be  in  the  affine  subspace  defined  by  Ax  —  b  is 
factored  out,  leaving  the  constraint  of  non-negativity 
(Linear  Least  Squares  is  abbreviated  LLS,  while  Affine 
Spaced  is  abbreviated  AfS): 

Vx,  A  fS  LLS(x,  A/5)  €  AfS 

Vx,y,A/5  LLS(x  4-  y,AfS)  =  LLS(x,AfS)  + 
LLS(y,  AfS) 

Vx,j /,c,AfS  cx  >  cy  =>  c(LLS(x,  AfS))  > 
c(LLS(y,  AfS)) 

These  equations  are  elementary  consequences  of  lin¬ 
ear  least  squares  being  an  orientation-preserving  linear 
operator.  A  simple  consequence  of  these  equations  is 
the  following  theorem  which  is  used  in  turn  to  derive  a 
sufficient  condition  on  the  better-element  function. 

Vx,y,c,A/S  cy  >  0  =>  c(LLS(x  +  y,A/S))  > 
c(LLS(x,  AfS)) 

Vx,  y,  c,  A f  S  x,  >  0  AND  cy  >  0  AN D  x  +  y  >  0  => 
Better  El  ement(LLS(x,  Ax  =  6))  =  LLS(x  +  y,  Ax  = 

b) 

Thus  constraint  propagation  yields  sufficient  condi¬ 
tions  on  an  increment  vector  y  for  iterative  improve¬ 
ment.  These  conditions  are  independent  of  the  partic¬ 
ular  affine  subspace,  because  the  linear  least  squares 
operator  incorporates  the  constraint  of  being  in  the 
affine  subspace.  It  is  these  sufficient  conditions  which 
can  be  reformulated  through  positive-scaling  transfor¬ 
mations  to  the  canonical  problem  of  optimization  over 
a  unit  sphere  centered  at  the  vector  (1.  1,  1, ..  1  )T.  We 
assume  that  the  following  domain  knowledge  is  avail¬ 
able  to  STRATA: 

Inscribed(Sphere((l,l  ,1,1  ...1  ),l),positiveorthant) 

Optimize(direction, sphere)  =  center  +  radius  * 
direction 

Through  additional  instantiation,  the  vector  x  is 
mapped  to  the  center  of  the  unit  sphere  and  the  incre¬ 
ment  vectory  is  mapped  to  the  radius*direction.  This 
gives  a  full  instantiation  of  the  BetterElement  function 
for  any  direction  vector  and  affinespace  if  the  current 
interior  point  is  the  unit  vector  (1,  1,  1...1)  .  However, 
Invariant  Logic  haw  shown  that  the  linear  optimization 
problem  is  invariant  under  positive  scaling!  Thus,  it  is 
always  possible  to  reformulate  to  the  canonical  prob¬ 
lem  where  the  current  interior  point  is  the  unit  vector. 
This  yields  the  algorithm  discussed  in  section  3. 

The  hand  simulated  derivation  given  above  cap¬ 
tures  the  major  features  of  the  Karmarker  linear  pro¬ 
gramming  algorithm.  The  methods  of  Invariant  Logic 
and  Parameterized  Theory  Instantiation  play  a  critical 
role  in  incorporating  domain  knowledge  and  algorithm 
knowledge  into  the  derivation.  This  derivation  ignores 
many  special  cases,  which  would  be  tedious  to  present 


840 


in  this  paper.  At  present,  my  research  work  includes 
mechanizing  this  derivation  and  extending  the  methods 
to  incorporate  efficiency  constraints. 


and  should  not  be  interpreted  as  representing  the  of¬ 
ficial  policies,  either  expressed  or  implied  of  the  U.S. 
Government. 


5  Summary 


References 


This  paper  has  discussed  the  capabilities  needed  for 
deriving  Image  Understanding  algorithms.  Automati¬ 
cally  formulating  an  ill-structured  problem  as  a  formal 
mathematical  problem  is  beyond  the  horizon  of  current 
AI  research.  Deriving  an  efficient  algorithm  for  a  for¬ 
mal  mathematical  problem  is  the  subject  of  this  paper. 

One  major  requirement  is  to  represent  high-level  pro¬ 
gramming  knowledge  in  a  clean  modular  representa¬ 
tion.  This  facilitates  the  interaction  of  domain  knowl¬ 
edge  and  programming  knowledge.  Parameterized  the¬ 
ories,  such  as  the  theory  for  iterative  improvement,  are 
used  in  STRATA  to  represent  high-level  programming 
knowledge.  Instantiating  the  parameters  for  a  partic¬ 
ular  problem,  such  as  linear  optimization,  results  in  a 
top-level  description  of  an  algorithm.  Since  the  param¬ 
eters  are  mutually  constraining  through  the  axioms  of 
the  theory,  instantiating  the  parameters  is  a  constraint 
satisfaction  problem. 

The  second  major  requirement  for  deriving  geomet¬ 
ric  algorithms  is  to  extend  abstract  data  types  to  the 
geometric  domain.  From  the  viewpoint  of  modern  ge¬ 
ometry,  finding  an  abstract  description  of  a  geometric 
problem  is  equivalent  to  finding  the  group  of  trans¬ 
formations  which  leave  it  invariant.  This  paper  de¬ 
scribes  Invariant  Lope  for  determining  the  invariants 
of  a  problem.  The  invariants  can  be  used  to  reformulate 
a  problem  to  a  known  canonical  problem. 

The  STRATA  automatic  programming  system  uses 
parameterized  theory  instantiation  and  invariant  logic 
to  derive  algorithms  for  formal  geometric  problems. 
These  two  methods  are  illustrated  with  the  derivation 
of  a  variant  of  the  Karmarker  linear  programming  algo¬ 
rithm.  These  methods  are  generalizeable  to  the  deriva¬ 
tion  of  other  algorithms  in  Image  Understanding. 


[Bar86]  David  Barstow.  An  experiment  in 
knowledge-based  automatic  programming. 
In  Charles  Rich  and  Richard  C.  Waters, 
editors,  Artificial  Intelligence  and  Software 
Engineering,  Morgan  Kaufmann  Publish¬ 
ers,  Inc.,  1986. 

[CK85]  Dong  J.  Choi  and  John  R.  Render.  Solving 
the  depth  interpolation  problem  with  the 
adaptive  chebyshev  acceleration  method  on 
a  parallel  computer.  In  Proceedings:  Image 
Understanding  Workshop ,  pages  219-223, 
DECEMBER  1985. 


[Fea82]  Martin  M  Fischler  and  et  al.  Modeling  and 
using  physical  constraints  in  scene  analysis. 
In  AAAI-82,  August  1982. 

[GB85]  Joseph  A.  Goguen  and  Richard  M. 

Burstall.  Institutions:  Abstract  Model 
Theory  for  Computer  Science.  Technical 
Report  CSLI-85-30,  CSLI,  1985. 

[GTW78]  Joseph  Goguen,  Jim  Thatcher,  and  Eric 
Wagner.  An  initial  algebra  approach  to  the 
specification,  correctness,  and  implementa¬ 
tion  of  abstract  data  types.  In  Raymond  T. 
Yeh,  editor,  Current  Trends  in  Program¬ 
ming  Methodology,  pages  80-149,  Prentice- 
Hall,  1978. 

[Kha79]  L.  G.  Khachiyan.  A  polynomial  algorithm 
for  linear  programming.  Soviet  Math  Dok- 
lady ,  20:191-94,  1979. 


6  Acknowledgements 

This  paper  benefited  from  discussions  with  Professor 
Thomas  Binford,  Dr.  Yinyu  Ye  (Stanford-Operations 
Research),  Dr.  Joseph  Goguen  (SRI-Theory  of  Ab¬ 
stract  Data  Types),  Dr.  George  Stolfi  (Stanford- 
Computational  Geometry),  Dr.  Douglas  Smith  and  Dr, 
Cordell  Green  (Kestrel  Institute-knowledge  based  au¬ 
tomatic  programming).  This  work  was  supported  in 
part  by  DARPA  contract  N00039-84-C-0211,  and  ONR 
contract  N00014-84-C-0473.  The  views  and  conclusions 
contained  in  this  paper  are  solely  those  of  the  author 


[KN83]  Elaine  Kant  and  Allen  Newell.  An  auto¬ 
matic  algorithm  designer:  an  initial  imple¬ 
mentation.  In  AAAI-83,  1983. 

[Lee85]  David  Lee.  A  provably  convergent  algo¬ 
rithm  for  shape  from  shading.  In  Pro¬ 
ceedings:  Image  Understanding  Workshop, 
pages  489-495,  DECEMBER  1985. 

[LM81]  Michael  R.  Lowry  and  Allen  Miller.  A  gen¬ 
eral  purpose  vlsi  chip  for  computer  vision 
with  fault-tolerant  hardware.  In  Image  Un¬ 
derstanding  Workshop,  April  1981. 


841 


[LMS2] 

[Low82] 

[Low87] 

[PRH85] 

[Ric86] 

[Smi85] 

[SSS86] 

[WBKL83] 


Michael  R.  Lowry  and  Allen  Miller,  Analy¬ 
sis  of  low-level  computer  vision  algorithms 
for  implementation  on  a  vlsi  processor  ar¬ 
ray.  In  SPIE-82,  August  1982. 

Michael  R.  Lowry.  Reasoning  between 
structure  and  function.  In  Image  Under¬ 
standing  Workshop,  September  1982. 

Micheal  R.  Lowry.  Algorithm  Synthesis 
through  Problem  Reformulation.  PhD  the¬ 
sis,  Stanford  University,  1987. 

Igor  Pavlin,  Edward  Riseman,  and  Allen 
Hanson.  Analysis  of  an  algorithm  for  de¬ 
tection  of  translational  motion.  In  Pro¬ 
ceedings:  Image  Understanding  Workshop, 
pages  388  -399,  DECEMBER  1985. 

Charles  Rich.  A  formal  representation  for 
plans  in  the  programmer’s  apprentice.  In 
Charles  Rich  and  Richard  C.  Waters,  ed¬ 
itors,  Artificial  Intelligence  and  Software 
Engineering.  Morgan  Kaufmann  Publish¬ 
ers,  Inc.,  1986. 

Douglas  R.  Smith.  Top-down  synthesis  of 
divide-and-conquer  algorithms.  Artificial 
Intelligence,  27(1),  September  1985. 

Edmond  Schonberg,  Jacob  T.  Schwartz, 
and  Micha  Sharir.  An  automatic  technique 
for  selection  of  data  representations  in  sell 
programs.  In  Charles  Rich  and  Richard  C. 
Waters,  editors,  Artificial  Intelligence  and 
Software  Engineering,  Morgan  Kaufmann 
Publishers,  Inc.,  1986. 

Patrick  Winston,  Thomas  Binford,  Boris 
Katz,  and  Michael  R.  Lowry.  Learning 
physical  descriptions  from  functional  def¬ 
initions,  examples,  and  precedents.  In 
AAAI-83,  August  1983. 


a 


842 


GENERALIZING  EPIPOLAR-PLANE  IMAGE  ANALYSIS 
FOR  NON-ORTHOGONAL  AND  VARYING 
VIEW  DIRECTIONS  * 


H.  Harlyn  Baker 
Robert  C.  Bolles 
David  H.  Marimont 

Artificial  Intelligence  Center 
SRI  International 
333  Ravenswood  Avenue 
Menlo  Park,  CA  94025. 


Abstract 

Earlier  reports  have  described  the  theory  and  practice 
of  Epipolar-Plane  Image  Analysis.  Our  previous  imple¬ 
mentations  of  this  mapping  technique  demonstrated  the 
feasibility  and  benefits  of  the  approach,  but  were  carried 
out  for  restricted  camera  geometries.  The  question  of  more 
general  geometries  made  utility  for  autonomous  navigation 
uncertain.  Here,  we  discuss  a  generalisation  of  the  analy¬ 
sis  that  a)  enables  varying  view  direction  (including  vary¬ 
ing  over  time),  b)  provides  three-dimensional  connectivity 
information  for  building  coherent  spatial  descriptions,  of 
observed  objects,  and  c)  raises  the  possibility  of  removing 
the  restriction  of  travel  along  a  linear  trajectory. 


X:  Introduction 

Approaches  to  depth  measurement  through  stereo  analy¬ 
sis  suffer  from  the  dichotomy  of  choosing  between  a  wide  base¬ 
line,  with  high  precision  of  matched  features  but  both  increased 
match  failures  and  increased  difficulties  of  perspective  and  oc¬ 
clusion  effects,  and  a  narrow  baseline,  with  easy  matching  but 
poor  depth  accuracy.  This  is  an  obvious  limitation  confronted 
whenever  depth  measurements  must  be  made  on  the  basis  of  just 
2  views  of  a  scene.  In  this  work  we  take  a  direction  that  gains 
the  advantages  of  both  approaches.  We  process  a  large  num¬ 
ber  of  closely  spaced  images:  the  close  spacing  makes  matching 
easy,  and  the  large  number  of  images  means  a  wider  baseline, 
and  therefore  higher  accuracy.  While  bearing  the  increased  cost 
of  processing  the  large  number  of  images,  and  having  to  know 
precisely  the  position  and  attitude  of  the  camera  at  each  imag¬ 
ing  site,  our  technique  brings  significant  advantages  in  accuracy 
and  reliability  over  existing  depth  measurement  approaches. 

Our  approach  (see  [2]  and  (3)  for  details)  is  to  take  a  se¬ 
quence  of  images  from  positions  that  are  very  close  together  - 
close  enough  that  almost  nothing  changes  from  one  image  to 
the  next.  With  this  capture  spacing,  none  of  the  image  features 
moves  more  than  a  few  pixels.  This  sampling  frequency  guaran¬ 
tees  a  continuity  in  the  temporal  domain  similar  to  the  obvious 
spatial  continuity,  and  lets  us  avoid  the  difficult  correspondence 
problem.  An  edge  of  an  object  in  one  image  appears  temporally 
adjacent  to  (within  a  pixel  or  so  of)  its  occurrence  in  both  the 
preceding  and  following  images.  This  rapid  sampling  makes  it 
possible  to  construct  a  volume  of  data  (spatio-temporal  data)  in 
which  time  is  the  third  dimension  and  continuity  is  maintained 
over  all  three  dimensions  (Figure  1).  In  one  of  the  traditional 
motion-analysis  paradigms  features  are  detected  in  successive 
spatial  images  and  matched  from  one  to  the  next,  with  the  differ¬ 
ence  in  position  used  to  deduce  motion  (see,  for  example  Roach 
(11]).  In  another,  differential  techniques  are  used  between  succes¬ 
sive  images  to  measure  the  direction  of  intensity  variation,  and 


thereby  the  direction  of  inferred  motion  (for  example,  Buxton  [6] 
and  Waxman  [12]).  We,  however,  take  an  approach  orthogonal 
to  these  (literally).  We  slice  the  spatio-temporal  data  along  a 
temporal  dimension,  locate  features  in  these  slices,  observe  their 
motion  over  time,  and  use  this  motion  to  determine  the  three- 
dimensional  feature  locations.  This  temporal  slicing,  into  what 
we  call  epipolar -plant  images,  or  EPIs,  can  be  done  whenever  the 
camera  moves  in  a  straight  line  (the  direction  and  complexity  of 
the  slicing  varies  with  the  camera  parameters).  We  discuss  the 
geometrical  considerations  of  this  approach,  point  out  its  major 
strengths  and  weaknesses,  and  present  the  mechanism  for  gener¬ 
alizing  the  analysis  to  an  unrestricted  range  of  camera  viewing 
directions  along  a  linear,  and  possibly  non-linear,  path. 


Fig.  1.  Spatio-temporal  volume  of  data. 


"This  research  wm  supported  by  DARPA  Contracts  MDA  9O3-8ft-C-0OM  and  DACA  76- 8 5- C- 0004 . 


843 


2:  Analysis  in  the  Epipolar  Plane 

We  model  the  camera  as  a  pin-hole  with  image  plane  in 
front  of  the  lens  (Figure  2).  For  each  feature  P  in  the  scene  and 
two  viewing  positions  Vi  and  Vj,  there  is  an  epipolar  plane,  which 
passes  through  P  and  the  line  joining  the  two  lens  centers.  This 
plane  intersects1  the  two  image  planes  along  corresponding  epipo¬ 
lar  lines.  An  epipole  is  the  intersection  of  an  image  plane  with 
the  line  joining  the  lens  centers.  In  motion  analysis,  an  epipole 
is  often  referred  to  as  the  focus  of  expansion  (FOE)  because  the 
epipolar  lines  radiate  from  it.  In  our  work,  the  camera  moves 
in  a  straight  line,  and  the  lens  centers  at  the  various  viewing 
positions  lie  along  this  line.  Here,  the  FOE  is  the  camera  path. 
This  structuring  divides  the  scene  into  a  pencil  of  planes  passing 
through  the  camera  path.  We  view  this  as  a  cylindrical  coor¬ 
dinate  system  with  axis  ( H )  the  camera  path,  angle  (6)  defined 
by  the  epipolar  plane,  and  radius  ( R )  the  distance  of  the  feature 
from  the  axis.  Note  that  a  scene  feature  is  restricted  to  a  single 
epipolar  plane,  and  any  scene  features  at  the  same  angle  (within 
the  discretization)  share  that  epipolar  plane.  This  means  that 
the  analysis  of  a  scene  can  be  partitioned  into  a  set  of  analyses, 
one  for  each  EPI,  and  these  EPIs  can  be  processed  independently 
and  in  parallel. 


Figure  3  shows  a  simple  motion  with  a  camera  moving 
orthogonal  to  its  direction  of  view.  Here,  the  epipolar  lines  for 
a  feature,  such  as  P,  are  I  irizontal  scanlines,  and  these  occur 
at  the  same  vertical  positiou  (scanline)  in  all  the  images.  Each 
scanline  is  a  projected1  observation  of  the  features  in  the  epipolar 
plane.  The  projection  of  P  onto  these  epipolar  lines  moves  to  the 
right  as  the  camera  moves  'o  the  left.  For  a  constant  camera 
motion,  the  velocity  of  this  movement  along  the  epipolar  line  is 
a  function  of  P's  distance  from  the  line  joining  the  lens  centers. 
The  closer  the  feature,  the  £  eater  is  its  motion.  For  this  type  of 
lateral  motion  the  trajectories  are  straight  lines  (see  Adelson  [lj, 
Bridwell  [4]  and  Yamamoto  [13]  for  some  simple  observations  on 
this  structuring  that  predate  our  use  of  EPIs).  Figure  4  shows 
a  set  of  epipolar  lines  (here,  scanlines)  made  into  an  image  (an 
EPI).  Each  horizontal  line  of  the  EPI  is  one  image  scanline.  Time 
progresses  from  bottom  to  top,  and,  as  the  camera  moves  to  the 
left,  the  features  move  to  the  right. 


'Here,  interaection  and  projection  are  equivalent. 


Fig.  4.  Epipolar-plane  image  from  right-to-left  motion. 


There  are  several  things  to  notice  about  this  image.  First, 
it  contains  only  linear  structures.  Second,  the  slopes  of  the 
lines  determine  the  distances  to  the  corresponding  features  in 
the  world  -  the  greater  the  slope,  the  farther  the  feature.  Third, 
occlusion,  which  occurs  when  a  closer  feature  moves  in  front  of 
a  more  distant  one,  is  immediately  apparent  in  this  representa¬ 
tion.  These  suggest  the  approach  we  should  take  in  analyzing  the 
data:  find  the  linear  structures  (feature  paths)  in  the  EPIs;  join 
together  paths  that  are  collinear,  although  broken  by  occlusion; 
use  the  slopes  of  the  linear  paths  to  determine  scene  depth. 

The  EPI  in  Figure  4  was  constructed  from  a  simple  right- 
to-left  motion  with  the  camera  oriented  at  right  angles  to  the 
path.  As  long  as  the  lens  center  of  the  camera  moves  in  a  straight 
line  the  epipolar  planes  remain  fixed  relative  to  the  scene  (as  in 
Figure  2).  The  camera  can  even  change  its  orientation  about 
its  lens  center  as  it  moves  along  the  line  without  affecting  this 
partitioning  of  the  scene.  Orientation  changes  move  the  epipolar 
lines  around  in  the  image  plane,  significantly  complicating2  the 
construction  of  the  EPIs,  but  the  epipolar  planes  remain  fixed. 
Figure  5a  is  an  EPI  formed  from  a  sequence  of  images  taken  by 
a  camera  looking  straight  ahead  as  it  moves  -  notice  that  the 
feature  paths  are  hyperbolic.  Allowing  the  camera  to  change 
orientation  smoothly  as  it  moves  along  gives  rise  to  EPIs  with 
curves  as  those  in  Figure  5b. 

These  non-orthogonal  viewing  directions  would  seem  to 
make  our  tracking  problem  non-linear.  To  maintain  the  linearity 
regardless  of  viewing  direction,  we  chose  a  representation  that 
leads  to  an  approach  different  from  that  outlined  above.  We  do 
not  find  linear  feature  paths  in  the  EPIs;  rather,  we  find  linear 
paths  in  a  dual  space.  Our  insight  here  (see  Marimont  [8]  and  [9]) 
is  that  no  matte'  where  a  can.  era  roams  about  a  scene,  for  any 
particular  feature,  the  lines  of  sight  from  the  camera  principal 
point  through  that  feature  in  space  (determined  by  the  line  from 
the  principal  point  through  the  point  in  the  image  plane  where 
the  projected  feature  is  observed)  all  intersect  at  the  feature  (ig¬ 
noring  measurement  error).  The  duals  of  these  lines  of  sight  lie 
along  a  line  whose  dual  is  the  scene  point:  fitting  a  point  to  the 
lines  of  sight  is  a  linear  problem. 


“Notice  that  only  in  the  cue  of  orthogonal  viewing  will  the  intersections 
of  epipolar  planet  with  image  pianet  be  parallel  and  featnra  paths  be  straight 

lines. 


844 


3:  Generalization 


Fig.  5a.  EPI  with  forward  view  direction. 


Fig.  5b.  EPI  with  varying  view  direction. 


Capitalizing  on  these  observations,  our  analysis  of  feature 
motion  in  the  epipolar  planes  consisted  of  the  following  steps: 

1.  3D  convolution  of  the  spatio-temporal  data 
2  Slicing  the  convolved  data  into  EPIs 

3.  Detecting  edges  in  the  EPIs  and  converting  them  to  lines 
of  sight 

4.  Breaking  these  lines  of  sight  into  linear  segments  (in  Dual 

space) 

5  Merging  collinear  segments 

6.  Computing  x-y-z  coordinates 

7.  Building  a  map  of  free  space 

8.  Linking  x-y-z  points  between  EPIs 

Our  first  implementation  was  described  in  Bolles  [2],  and 
the  current  (as  above)  implementation  is  discussed  and  demon¬ 
strated  in  Bolles  [3]. 

The  following  should  be  noted  about  this  analysis: 

•  There  are  some  obvious  ways  to  make  it  incremental  in 
time,  and  partitionable  in  V  (epipolar  planes),  for  high 
speed  performance; 

•  Although,  in  principle,  the  camera’s  view  direction  is  unre¬ 
stricted,  its  motion  must  be  in  a  straight  line8; 

•  Frame  rate  must  be  high  enough  to  limit  the  frame-to-frame 
changes  to  a  pixel  or  so  (more  specifically,  such  that  the 
motion  of  a  surface  is  less  than  its  projected  width); 

•  Independently  moving  objects  will  either  not  be  detected, 
or  will  be  detected  inaccurately; 

•  Curved  objects  may  either  be  viewed  as  polygonal  approx¬ 
imations,  or  appear  as  moving  objects  (Bee  [3]). 


’If  thr  lens  center  does  not  move  in  a  straight  line,  the  epipolar  planet 
passing  through  a  world  point  differ  from  one  camera  position  to  the  next. 
The  points  in  the  scene  are  grouped  one  way  for  the  first  and  second  camera 
positions,  a  different  way  for  the  second  and  third,  and  so  on.  This  makes 
it  impossible  to  partition  the  scene  into  a  fixed  set  of  planes,  which  in  turn 
means  that  it  it  not  possible  to  construct  EPIs  for  such  a  motion.  The 
arrangement  of  epipolar  lines  between  images  mutt  be  transitive  for  EPIs  to 
be  formed. 


Although  demonstrating  the  feasibility  of  this  novel  ap¬ 
proach  to  depth  analysis,  the  implementation  described  above 
had  several  significant  limitations: 

1.  While  the  theory  placed  no  restriction  on  camera  view  di¬ 
rection,  the  implementation  could  not  handle  any  motions 
other  than  those  along  a  linear  path  where  the  view  direc¬ 
tion  was  fixed  and  perpendicular  to  the  motion.  It  could  not 
handle  the  obvious  cases  of  view  in  the  direction  of  motion, 
or  view  changing  with  motion. 

2.  Working  with  epipolar  planes  separately,  there  was  no  sat¬ 
isfactory  way  to  combine  results  between  planes,  or,  of 
equal  importance,  to  share  information  between  nearby 
planes  in  producing  the  results.  This  led  to  fragmentation 
of  the  depth  results,  and  a  lacking  in  three-dimensional 
coherence4  for  both  depth  results  and  free  space  (see  [3]). 

3.  The  processing  was  structured  to  work  on  an  entire  image 
sequence  at  one  time,  requiring  that  all  images  be  taken 
before  analysis  was  begun. 

4.  There  was  no  way,  within  the  framework  developed,  to  re¬ 
move  the  restriction  to  a  linear  camera  path. 


These  restrictions  put  a  severe  lower  limit  on  the  ability  of 
the  technique  to  show  any  practical  usefulness  for  real  tasks  of 
navigation  or  3-D  sensing.  We  have  been  looking  at  these  limi¬ 
tations,  and  developing  new  techniques  to  remove  them.  In  fact, 
the  mechanisms  for  their  removal  all  rest  on  a  single  restructuring 
of  the  analysis. 

We  collect  the  data  as  a  sequence  of  images.  As  each  new 
image  is  acquired,  we  construct  its  spatial  and  temporal  edge  con¬ 
tours.  These  contours  are  three-dimensional  zeros  of  the  Lapla- 
cian  of  a  chosen  three-dimensional  Gaussian5  (LoG),  and  the 
construction  produces  a  spatio-temporal  surface  enveloping  the 
positive  (or  negative)  volumes  (in  two-dimensions,  edge  contours 
envelop  positive  or  negative  regions).  The  spatial  connectivity 
in  this  structure  lets  us  explicitly  maintain  object  coherence  be¬ 
tween  features  observed  on  separate  epipolar  planes  (the  second 
limitation  above);  the  temporal  connectivity  gives  us,  as  before, 
the  tracking  of  features  over  time,  and  therefore,  for  known  cam¬ 
era  parameters,  the  positions  of  those  features  in  space.  We 
can  use  the  spatial  connectivity  after  processing,  to  form  con¬ 
nected  feature  descriptions,  and  also  during  the  analysis,  in  giving 
us  greater  support  for  determining  our  estimates  (incorporating 
weighted  observations  from  adjacent  epipolar  planes)  and  letting 
us  associate  features  that  appear  disjoint  in  the  epipolar  plane  in 
which  they  occur,  but  lie  on  a  common  surface  (ie,  occlusion  is 
not  complete  over  all  time  and  space,  and  there  is  some  perspec¬ 
tive  from  which  the  features  are  not  disjoint). 

In  this  spatio-temporal  surface  description,  feature  obser¬ 
vations  bear  (u,v,t)  coordinates,  and  are  spatio-temporal  voxel 
facets.  Figure  7  shows  a  mesh  description  of  the  facets  for  the 
spatio-temporal  surface  associated  with  the  forward-viewing  se¬ 
quence  whose  first  and  last  images6  are  depicted  in  Figure  6  - 
surfaces  are  interpolated  frontiers  separating  the  positive  and 
negative  LoG  voxels.  Now,  for  non-orthogonal  camera  viewing  di¬ 
rections,  epipolar  lines  are  not  scanlines  (i.e.  the  epipolar  planes 
are  not  distinguished  by  the  v  coordinate).  To  obtain  this  nec¬ 
essary  structuring,  we  develop  within  this  spatio-temporal  de¬ 
scription  an  embedded  representation  that  makes  the  epipolar 


4We  aren’t  alone  in  this  -  no  other  depth  sensing  technique  produces  this 
explicit  coherence  either. 

’See  also  Buxton  [5]  and  Heeger  [7)  for  their  use  of  spatio-temporal  con¬ 
volution  over  an  image  sequence. 

’The  surface  representations  shown  here  are  based,  for  clarity  and  devel¬ 
opment,  on  a  reduced  version  of  the  imagery  -  one  eighth  the  linear  resolution 
of  the  originals. 


845 


Fig.  10.  Single  spatio-temporal  surface  from  top  left  of  Figure  7. 


Fig.  11.  Epipolar-plane  representation  for  surface  of  Figure  10. 


organization  explicit.  Over  each  of  the  sequential  images,  we 
transform  the  (u,v,t)  coordinates  of  our  spatio-temporal  zeros 
to  (r,  h,  9)  cylindrical  coordinates  (9  indicates  the  epipolar-plane 
angle  (9  €  [0, 2x]),  the  quantized  resolution  in  9  is  a  supplied  pa¬ 
rameter,  and  the  transform  for  each  image  is  determined  by  the 
particular  camera  parameters).  In  this  new  coordinate  system, 
we  build  a  structure  very  much  like  our  earlier  epipolar-plane 
edge  contours,  but  organized  by  epipolar  plane.  This  is  done  by 
intersecting  the  spatio-temporal  surfaces  with  the  pencil  of  ap¬ 
propriate  epipolar  planes7  (as  Figure  2).  We  weave  the  epipolar 
connectivity  through  the  spatio-temporal  volume,  following  the 
known  camera  viewing  direction  changes.  Figure  8  shows  a  sam¬ 
pling  of  the  spatio-temporal  surfaces  as  they  intersect  the  pencil 
of  epipolar  planes  (every  fifth  plane  is  depicted).  Figure  9  shows 
some  of  these  surface-plane  intersections,  along  with  the  associ¬ 
ated  bounding  planes  (refer  to  Figure  2).  Figure  10  isolates  a 
single  surface  from  the  top  left  of  Figure  7,  and  shows  it’s  spatio- 
temporal  structure.  Figure  11  shows  the  same  surface  structured 
by  its  epipolar-plane  components. 


By  maintaining  the  epipolar  structure  irrespective  of  cam¬ 
era  attitude,  we  have  provided  a  remedy  to  the  first  limitation 
cited  above  (non-orthogonal  and  varying  viewing  directions).  In 
these  planes,  feature  paths  will  be,  in  general,  arbitrary  curves 
in  (u,  v,t)  space,  but  fines  in  the  space  of  lines  of  sight. 

7Notic«  that,  with  view  direction  varying,  the  underlying  epipolar  plane 
may  be  far  from  planar  in  (u,  v,  f)  space  -  it  may  undulate  in  a  manner  similar 
to  that  in  which  Figure  5b  varies  from  Figure  4,  and  for  similar  reasons. 


The  fact  that  the  analysis  is  performed  as  images  are  ac¬ 
quired  means  that  the  processing  may  be  incremental  in  time, 
and  this  removes  the  requirement  mentioned  in  limitation  3 
where,  earlier,  the  entire  sequence  had  to  be  obtained  before 
processing  could  be  begun.  The  epipolar  re-structuring  is  also 
incremental  in  time.  We  are  still  developing  the  incremental  fea¬ 
ture  estimation. 

A  crucial  constraint  of  the  current  epipolar-plane  image 
analysis  is  that  having  a  camera  moving  along  a  linear  path  en¬ 
ables  us  to  divide  the  analysis  into  planes,  in  fact,  the  pencil  of 
planes  that  passes  through  the  camera  path.  With  this,  we  are 
assured  that  a  feature  will  be  viewed  in  just  a  single  one  of  these 
planes,  and  its  motion  over  time  will  be  confined  to  that  plane. 
Another  crucial  constraint  is  the  one  we  generalized  from  the 
orthogonal  viewing  case  -  we  know  that  the  set  of  line-of-sight 
vectors  from  camera  to  feature  over  time  will  all  intersect  at  that 
feature,  and  determining  that  feature’s  position  is  a  linear  prob¬ 
lem.  This  latter  constraint  does  not  depend  upon  the  former.  In 
fact,  the  problem  would  remain  linear  even  if  the  camera  mean¬ 
dered  all  over  the  scene.  This  knowledge  gives  us  a  possibility 
of  resolving  the  fourth  limitation  above  -  that  the  camera  path 
be  linear.  All  that  the  linear  path  guarantees  is  that  the  prob¬ 
lem  is  divisible  into  epipolar  planes.  If  we  lose  this  constraint, 
then  we  cannot  restrict  our  feature  tracking  to  separate  planes. 
The  features  will,  however,  still  form  linear  paths  in  the  space  of 
line-of-sight  vectors,  and  our  spatio-temporal  surface  description 
is  an  appropriate  representation  for  doing  this  non-planar,  but 
still  linear,  tracking.  The  motion  of  features  will  give  us  ruled 


surfaces,  with  the  rules  (zeros  of  gaussian  curvature)  revealing 
the  positions  of  the  features  in  space8.  This  generality  suggests 
that  there  is  even  broader  application  for  the  technique  than  we 
had  initially  thought.9 

5:  Conclusions 

We  showed,  in  our  earlier  work,  the  feasibility  of  extracting 
scene  depth  information  through  Epipolar-Plane  Image  Analysis. 
In  this,  our  results  in  determining  scene  depth  are  distinctive  in 
that  they: 

•  bridge  the  dichotomy  between  accuracy  and  ease  of  match¬ 
ing; 

•  through  the  use  of  an  increasing  baseline,  give  highly  accu¬ 
rate  estimates  of  feature  position,  carrying  associated  mea¬ 
sures  of  that  accuracy  for  each  feature  tracked; 

•  both  indicate  occluding  contours,  and  use  them  in  collect¬ 
ing  observations  to  improve  feature  estimates; 

•  provide  not  only  feature  3-D  position  measures,  but  also 
information  about  scene  free  space  -  information  which  is 
traditionally  quite  difficult  to  acquire; 

Our  theory  applies  for  any  motion  where  the  camera  (lens 
center)  moves  in  a  straight  line,  and  the  reported  implementa¬ 
tions  covered  the  case  of  viewing  direction  orthogonal  to  the  cam¬ 
era  path.  The  generalizations  obtained  through  spatio-temporal 
surface  analysis,  demonstrated  here  and  currently  being  refined, 
bring  us  the  advantages  of: 

•  incremental  analysis; 

•  unrestricted  viewing  direction  (including  direction  varying 
along  the  path); 

«  spatial  coherence  in  our  results,  providing  connected  sur¬ 
face  information  for  scene  objects,  rather  than  point  esti¬ 
mates  structured  by  epipolar  plane; 

•  suggesting  it  may  be  possible  to  remove  the  restriction  that 
fixes  us  to  a  linear  path. 


*Viaualize  pick-up-sticks  jammed  in  a  box,  with  the  sticks  being  the  rules. 
8 Although  being  computationally  expensive,  it  would  be  possible  here 
to  use  the  pairwise  epipolar  constraints  between  images  to  constrain  rule 
tracking  on  the  spatio-temporal  surface. 


References 

[1]  Adelson,  Edward  H.,  and  James  R.  Bergen,.  “Spatiotemporal 
energy  models  for  the  perception  of  motion,”  Journal  of  the 
Optical  Society  of  America  A,  2:2  (1985),  284-299. 

[2]  Bolles,  R.  C.,  and  H.  H.  Baker,  “Epipolar-Plane  Image  Anal¬ 
ysis:  A  Technique  for  Analyzing  Motion  Sequences,”  Pro¬ 
ceedings:  DARPA  Image  Understanding  Workshop,  Miami 
Beach,  Florida,  (December  1985)  137-148. 

[3]  Bolles,  Robert  C.,  H.  Harlyn  Baker,  and  David  H.  Marimont, 
“Epipolar-Plane  Image  Analysis:  An  Approach  to  Deter¬ 
mining  Structure  from  Motion,”  premier  issue  of  The  In¬ 
ternational  Journal  of  Computer  Vision,  Kluwer  Academic 
Publishers,  to  appear,  January  1987. 

[4]  Bridwell,  Nelson  J.,  and  Thomas  S.  Huang,  “A  Discrete  Spa¬ 
tial  Representation  for  Lateral  Motion  Stereo,”  Computer 
Vision,  Graphics,  and  Image  Processing,  21  (1983),  33-57. 

[5]  Buxton,  B.F.,  and  Hilary  Buxton,  “Monocular  Depth  Per¬ 
ception  from  Optical  Flow  by  Space  Time  Signal  Process¬ 
ing,”  Proceedings  of  the  Royal  Society  of  London  Series  B 
218  (1983),  27-47. 

[6]  Buxton,  B.F.,  and  H.  Buxton,  “Computation  of  Optic  Flow 
from  the  Motion  of  Edge  Features  in  Image  Sequences,”  Im¬ 
age  and  Vision  Computing,  2:2  (1984). 

[7]  Heeger,  David  J.,  “Depth  and  Flow  from  Motion  Energy,” 
Proceedings  of  the  Fifth  National  Conference  on  Artificial 
Intelligence  (AAAI-86),  (Philadelphia,  August  1986),  657- 
663. 

[8]  Marimont,  David  H.,  “Inferring  Spatial  Structure  from  Fea¬ 
ture  Correspondences,”  Ph.D.  dissertation,  Stanford  Uni¬ 
versity,  1986. 

[9]  Marimont,  David  H.,  “Projective  Duality  and  the  Analysis 
of  Image  Sequences,”  Proceedings  of  the  Workshop  on  Mo¬ 
tion:  Representation  and  Analysis,  IEEE  Computer  Society, 
Kiawah  Island,  South  Carolina,  (May  1986),  7-14. 

[10]  Moravec,  Hans  P.,  “Visual  Mapping  by  a  Robot  Rover,”  Pro¬ 
ceedings  of  the  International  Joint  Conference  on  Artificial 
Intelligence,  (Tokyo,  August  1979),  598-600. 

[11]  Roach,  J.W.,  and  J.K.  Aggarwal,  “Determining  the  move¬ 
ment  of  objects  from  a  sequence  of  images,”  IEEE  Transac¬ 
tions  on  Pattern  Analysis  and  Machine  Intelligence  PAM1- 
2:6  (1980),  554-562. 

[12}  Waxman,  Allen  M.,  and  Kwangyoen  Wohn,  “Contour  Evo¬ 
lution,  Neighborhood  Deformation,  and  Global  Image  Flow: 
Planar  Surfaces  in  Motion,”  The  International  Journal  of 
Robotics  Research  4:3  (1985),  95-108. 

[13]  Yamamoto,  M.,  “Motion  Analysis  Using  the  Visualized  Lo¬ 
cus  Method,”  untranslated  Japanese  articles,  1981. 


848 


Detecting  Dotted  Lines  and  Curves  in  Random-Dot  Patterns* 


Richard  Vistnes 

AI  Lab,  Stanford  University,  Stanford,  California  94305 


Abstract 

Results  are  reported  of  psychophysical  experiments 
that  measured  human  performance  in  detecting  dot¬ 
ted  lines  and  curves  embedded  in  a  random-dot  back¬ 
ground.  As  pattern  density  increases,  more  point  scat¬ 
ter  and  curvature  is  tolerated.  A  statistical  model  for 
the  detection  of  non-accidental  patterns  such  as  lines 
and  curves  is  presented;  the  model  is  based  on  center- 
surround  operators  that  find  significant  differences  in 
dot  density  in  tin  elongated  region  as  compared  to  the 
local  surround.  Performance  predicted  by  the  model 
compares  well  with  experimental  results,  as  do  results 
of  computer  simulations. 

1  Introduction  and  Motivation 

This  paper  examines  one  aspect  of  human  perceptual 
organization  of  images:  the  preattentive  grouping  of 
image  elements  into  lines  and  curves.  I  report  the  re¬ 
sults  of  a  series  of  experiments  in  which  subjects  tried 
to  detect  dotted  line  segments  and  dotted  circular  arcs 
surrounded  by  randomly-placed,  identical  dots.  I  de¬ 
velop  a  theory  of  dotted  line  and  curve  detection,  from 
which  predictions  are  made  and  compared  to  the  exper¬ 
imental  data.  A  computer  simulation,  and  the  char¬ 
acteristics  of  its  performance,  offers  another  perspec¬ 
tive  on  the  mechanisms  underlying  the  perceptual  phe¬ 
nomenon. 

The  motivations  for  performing  these  experiments 
are  twofold.  First,  current  notions  of  edge  detection 
in  computer  vision  systems  involve  detecting  local  dis¬ 
continuities  in  the  image,  followed  by  linking  these 
“edgels"  into  smooth  curves  [1,11].  Zucker  [19,22,23] 
has  studied  the  linking  of  edgels  using  only  the  geomet¬ 
rical  information  in  the  edgel  positions.  This  removes 
information  about  contrast,  orientation,  size  and  other 
factors,  and  simplifies  the  problem;  I  follow  the  same 
approach  here.  One  motivation  for  studying  human 
performance  is  to  discover  new  algorithms  that  can  be 
used  in  computer  vision  systems. 

’This  is  *n  abridged  version  of  a  paper  submitted  to  Journal 
oj  the  Optical  Society  of  America,  Senes  A. 


Another  motivation  is  to  clarify  the  nature  of  human 
preattentive  processing.  The  grouping  problem  studied 
here  deserves  attention,  since  humans  can  easily  recog¬ 
nize  objects  depicted  in  briefly-presented  line  drawings 
[2].  My  own  informal  demonstrations  show  that  people 
can  recognize  briefly-presented  dotted-line  drawings  of 
objects  as  well,  even  when  those  dotted  lines  are  em¬ 
bedded  in  a  noisy  background.  Since  only  preattentive 
processes  are  at  work  in  these  recognition  tasks,  we 
conclude  that  the  information  obtained  from  them  is 
sufficient  for  recognition,  at  least  for  the  clews  of  shapes 
used  in  the  experiments.  This  paper  concentrates  on 
aspects  of  preattentive  line  and  curve  detection.  Oth¬ 
ers,  notably  Julesz  [7]  and  Treisman  [13],  have  studied 
other  aspects  of  preattentive  perception. 

1.1  Definitions 

How  shall  we  define  a  dotted  line?  Restricting  the 
definition  to  straight  lines  of  dots  is  pointless,  since 
a  slightly  jagged  dotted  line  does  not  satisfy  this  def¬ 
inition,  yet  we  still  perceive  it  as  a  line.  Consider  the 
dotted  lines  shown  in  Fig.  1.  We  certainly  consider 
the  structure  at  the  top  to  be  a  line,  and  probably  the 
one  in  the  middle  too.  But  what  about  the  one  at  the 
bottom,  with  its  distantly-spaced  dots?  It  is  not  clear 
whether  we  should  call  this  structure  a  line  or  not;  that 
decision  must  depend  on  the  context  and  on  the  goals 
of  the  vision  system.  If  this  bottom  structure  is  to  be 
considered  a  line,  it  is  not  clear  where  its  endpoints  are, 
nor  which  dots  it  contains. 

Thus,  it  can  be  misleading  to  decide  whether  or  not 
an  image  structure  constitutes  a  line.  Imagine  drawing 
the  dots  in  the  bottommost  structure  in  Fig.  1  closer 
and  closer  together:  the  closer  the  dots,  the  stronger 
the  impression  of  a  line.  There  is  no  definite  point  at 
which  we  call  more  closely-spaced  dots  a  line,  but  more 
distantly-spaced  dots  not  a  line.  Furthermore,  the  sig¬ 
nificance  of  a  pattern  depends  to  a  great  extent  on  the 
nearby  image.  The  bottom  line  in  Fig.  1  is  unnoticeable 
when  surrounded  by  other  dots  as  it  is;  if  the  dotted 
surround  were  removed  or  reduced  in  density,  this  line 
would  become  easily  detectable. 


849 


Figure  1.  Dotted  lines  of  different  significance.  The  line 
at  the  top  is  very  significant;  the  one  in  the  middle  is  less 
significant.  At  the  bottom  the  line  is  marginally  significant 

It  is  more  useful  to  assign  a  significance  level  to  a 
structure.  It  is  possible  to  determine  the  statistical 
significance  of  a  pattern:  how  likely  it  is  to  have  arisen 
at  random,  given  the  nature  of  the  image  nearby.  If  it 
is  necessary  to  decide  whether  a  structure  constitutes 
a  line  or  not,  then  the  significance  level  can  be  used: 
e.g.,  if  there  is  only  a  1%  chance  that  the  structure 
would  arise  by  chance,  given  the  nearby  image,  then  it 
might  be  useful  to  make  that  signal-to-symbol  decision. 
We  should  not  arbitrarily  set  thresholds  for  determin¬ 
ing  what  structures  we  consider  to  be  edges,  lines  and 
curves:  Binford  [4,3]  makes  the  same  argument.  For 
some  purposes  it  may  be  useful  to  ignore  patterns  that 
do  not  exceed  some  level  of  significance.  But  in  general 
it  is  reasonable  to  define  only  a  class  of  patterns  for 
which  to  search  in  the  image,  and  a  function  defined 
on  such  patterns  that  determines  their  significance. 

For  the  purposes  of  this  paper,  it  will  be  sufficient 
to  define  a  line  segment  in  a  (dotted)  image  as  a  nar¬ 
row  rectangular  region  in  which  the  density  of  dots  is 
significantly  higher  than  in  the  local  background.  Note 
that  this  definition  of  line  segment  treats  clusters  of 
dots  the  same  as  collinear  dots.  For  the  purpose  of 
detecting  line  segments,  however,  the  confusion  of  den¬ 
sity  with  collinearity  implied  in  the  current  definition 
should  cause  no  problems  in  any  real  vision  system. 
Furthermore,  the  implications  of  such  a  definition  seem 
to  agree  with  human  psychophysics,  as  we  shall  see. 

1.2  Preview 

First,  I  discuss  human  perception  of  dotted  lines  and 
dotted  curves  in  images.  I  review  the  results  of  my 
previous  psychophysical  experiments  that  determined 
the  power  and  limitations  of  human  preattentive  vision 
in  this  task,  and  present  some  new  results  that  are  more 
amenable  to  analysis. 

Second,  I  present  a  statistical  model  for  what  infor¬ 
mation  is  being  computed  from  the  image,  make  pre¬ 
dictions  about  the  limits  of  performance  based  on  that 
model,  and  compare  the  predictions  with  the  experi¬ 


mented  data.  I  show  that  curves  can  be  detected  by  line 
detectors,  and  that  the  characteristics  of  a  mechanism 
that  uses  this  approach  are  similar  to  those  observed  in 
humans.  I  discuss  computer  simulations,  in  which  three 
types  of  line-detection  operators  are  applied  to  stimuli 
similar  to  those  used  in  the  human  experiments;  the 
performance  of  these  operators  in  the  line  detection 
task  is  presented  and  compared  with  human  perfor¬ 
mance.  I  conclude  by  reviewing  related  research  and 
discussing  the  implications  of  the  results  presented  here 
for  human  and  computer  vision. 

2  Experimental  Methods 

This  section  introduces  two  experiments  that  deter¬ 
mined  the  parameters  affecting  the  detectability  of  dot¬ 
ted  lines  and  dotted  curves.  Experiment  I  studied  dot¬ 
ted  straight  lines,  Experiment  II  dotted  curves.  These 
target  patterns  were  displayed  to  subjects  for  a  short 
time  (about  200  msec),  preventing  prolonged  inspec¬ 
tion  of  the  image  and  forcing  subjects  to  use  preatten¬ 
tive  mechanisms. 

Experiment  I  (lines)  determined  how  close  the  dots 
in  the  line  need  to  be,  compared  to  those  in  the  sur¬ 
round,  in  order  for  the  target  to  be  detected.  We  expect 
that  targets  with  closely-spaced  dots  will  be  easier  to 
detect  than  those  with  dots  placed  farther  apart;  the 
experiments  quantified  this  efTect.  Another  parame¬ 
ter  of  interest  is  the  amount  of  transverse  variation,  or 
point  scatter,  in  the  line.  We  expect  that,  for  fixed  dot 
separation  in  the  target,  as  the  transverse  variation  in¬ 
creases,  the  line  will  be  more  difficult  to  detect;  again, 
the  experiments  quantify  this  effect. 

I  noticed  that  dotted  curves  are  more  difficult  to  de¬ 
tect  than  dotted  lines,  given  equal  dot  separations  in 
the  target.  We  expect  that  as  the  curvature  of  a  dotted 
arc  increases,  the  arc  will  become  more  difficult  to  de¬ 
tect.  Experiment  II  quantified  curvature  detectability 
limits  in  an  arc  as  the  dot  separation  varied. 

The  subject's  task  was  to  indicate  where  he  or  she 
thought  the  target  had  appeared,  in  a  four-alternative 
forced-choice  task.  The  target  was  located  in  one  of 
four  locations  in  the  image:  vertical  on  the  left  or  right, 
or  horizontal  on  the  top  or  bottom.  Some  examples  of 
the  kind  of  stimuli  used  appear  in  Fig.  2.  Detectability 
was  measured  by  accuracy  of  location  of  the  target  over 
50  trials. 

2.1  Subjects  and  Apparatus 

Two  graduate  students  served  as  subjects.  A  Symbol¬ 
ics  3600  Lisp  Machine  with  a  black-and-white  monitor 
generated  and  displayed  the  images.  The  distance  from 


850 


Figure  2  Examples  of  stimuli  used  in  the  experiments. 
The  actual  experiments  contained  only  one  target;  four  are 
shown  here  Parameters:  d,  =  20,  lengths  L  are  all  200,  and 
vl  =  0.  The  target  at  top  is  emphasized,  and  has  d,  -  18, 
vt  =  0,  H  =  0;  the  one  at  bottom  is  same,  except  H  =  40. 
At  left,  a  line  with  H  =  0,  dt  =  18,  tT  =  0.4;  at  right,  the 
same  except  d,  =  15.  The  actual  size  of  the  square  image 
was  12  7  cm 

screen  to  subject  was  35.9  cm.  The  size  of  the  square 
display  was  460  pixels  (12.7  cm),  subtending  13°  of 
visual  angle.  The  length  of  the  targets  was  200  pixels 
(11.0  cm),  subtending  5.7°.  Each  dot  was  a  2  x  2  square 
of  pixels,  the  side  subtending  3.4  minutes  of  arc.  The 
distance  from  the  fixation  point  at  the  center  of  the 
display  to  the  line  was  6.0  cm  or  3.1°. 

2.2  Stimuli 

The  target  dots  are  embedded  in  a  surround  of 
randomly-placed  dots  whose  appearance  is  described 
by  one  parameter,  the  average  spacing  between  dots 
d,  (distance  in  surround).  The  mean  number  of  dots 
falling  in  the  image  is  Ap,  where  A  is  the  area  of  the 
image  and  p  =  1  /d2f  is  the  dot  density.  An  approx¬ 
imation  to  a  Poisson  distribution  of  surround  dots  is 
achieved  by  placing  a  number  of  dots  at  random  in  the 
image;  the  number  of  dots  is  normally  distributed  with 
mean  and  variance  Ap,  as  in  a  Poisson  distribution. 

Lines 

The  dotted-line  generating  procedure  uses  the  length  L 
of  the  line  and  the  dot  spacing  dt  (distance  in  target) 
to  produce  a  line  containing  [ L/dt]  dots.  A  longitudi¬ 


Figure  3.  A  dotted  circular  arc.  Chord  length  is  L  and 
height  is  H . 

nal  variation  vi  specifies  the  maximum  fractional  vari¬ 
ation  along  the  line.  Dots  are  perturbed  about  their 
nominal  (evenly-spaced)  positions  by  |,Y dtvi  in  either 
direction,  where  A’  is  a  uniformly-distributed  random 
number  between  -1  and  1.  Similarly,  transverse  varia¬ 
tion  t it  specifies  the  maximum  variation  in  a  direction 
normal  to  the  line.  Dots  are  perturbed  about  their 
nominal  positions  by  an  amount  | Xdtvr\  X  is  again 
a  random  number  between  —1  and  1.  When  vi  =  0, 
the  dots  are  evenly-spaced;  when  vj  =  0,  the  dots  fall 
in  a  straight  line.  The  dots  generated  in  this  manner 
fall  in  a  rectangular  region  of  length  L  and  width  dt  vj. 
The  parameters  vt  and  vi  provide  a  scale-independent 
way  of  specifying  the  shape  of  the  line;  magnifying  or 
shrinking  the  pattern  does  not  change  vj  or  vi . 

Curves 

The  shape  of  an  arc  of  evenly-spaced  dots  (see  Fig.  3) 
can  be  specified  by  the  length  L  of  the  chord  joining 
the  ends  of  the  arc,  the  height  H  of  the  arc  above  this 
chord,  and  the  average  spacing  dt  of  dots  in  the  arc. 
In  these  experiments,  the  arcs  had  no  transverse  or 
longitudinal  variation.  When  //  is  zero,  an  arc  specified 
in  this  manner  reduces  to  a  straight  line  as  specified 
above.  The  radius  of  curvature  of  such  an  arc  is  given 

and  the  curvature  k  of  the  arc  is  just  1  / R. 

2.3  Analysis  of  data 

An  experiment  typically  consisted  of  varying  the  value 
of  one  parameter,  say  vj,  while  keeping  all  others 
fixed.  As  the  parameter  varied,  accuracy  of  detec¬ 
tion  changed,  usually  deteriorating  and  approaching 
the  chance  level  of  25%  for  the  4-alternative  task.  A 
psychometric  function  given  by 

P(x)  =  G  +  (7  -  rI')exp[-(j/o')d].  (2) 


851 


where  x  is  the  independent  variable,  was  fit  to  each 
such  set  of  data  points  using  a  method  described  by 
Watson  [16].  G  is  the  expected  fraction  correct  when 
the  subject  guesses;  here,  G  =  0.25. 

We  are  interested  in  the  80%  accuracy  threshold  of 
performance.  From  (2)  we  obtain: 


x  = 


a 


log 


7  -C 
T  -  G 


U/0) 


of  transverse  variation  v p  that  can  be  tolerated  in  a 
straight  line  is  independent  of  scale.  That  is,  maximum 
t’T  depends  on  the  relative  separation  of  target  dots 
compared  with  surround  dots,  d,/ds.  Similar  results 
hold  for  curve  detection:  the  amount  of  curvature  that 
can  be  tolerated  is  largely  independent  of  the  absolute 
density  of  the  patterns,  but  depends  mainly  on  d,/d,. 

3.2  Experiment  I 


with  T  ~  0.80. 

It  is  difficult  to  obtain  error  estimates  on  this  thresh¬ 
old.  However,  an  approximate  standard  deviation  was 
obtained  by  using  a  “bootstrap"  method,  as  follows. 
The  fitting  procedure  described  by  Watson  [16]  as¬ 
sumes  that  the  data  points  ( x ,  P(x))  are  Bernoulli  ran¬ 
dom  variables  with  some  underlying  probability  p(x) 
for  the  n  trials.  We  use  this  underlying  distribu¬ 
tion  function  for  each  data  point  to  generate  new  sets 
of  data  points;  each  new  data  point  comes  from  the 
Bernoulli  distribution  corresponding  to  x.  Since  p(x) 
and  n  completely  specify  the  Bernoulli  distribution, 
we  find  the  standard  deviation  of  observed  number 
correct  as  c'(x)  =  y/p(r)[l  -  p(x)]n.  The  standard 
deviation  of  fraction  correct  for  each  point  is  then 
<r(x)  =  cr'(x)/n  =  \/p(x)[  1  -  p(x)\/n.  A  threshold 
value  is  computed  for  this  new  data  set.  This  operation 
is  performed  several  times,  and  the  standard  deviation 
of  these  thresholds  is  used  as  the  standard  deviation  of 
the  actual  threshold. 

3  Experimental  Results 

3.1  Previous  results 


Transverse  variation  vp  was  the  parameter  of  interest 
for  straight  lines.  I  measured  how  much  variation  could 
be  tolerated  for  a  given  dt  and  d ,,  using  a  performance 
standard  of  80%  correct  as  the  criterion  for  detectabil¬ 
ity.  The  results  are  shown  in  Fig.  4a.  As  expected, 
more  transverse  variation  can  be  tolerated  when  dt/d, 
is  small. 

It  is  important  to  ask  if  these  results  are  indepen¬ 
dent  of  the  size  and  density  of  the  image  —  i.e..  if  the 
mechanism  is  scale  invariant.  If  it  were,  then  since  vj 
is  a  scale-independent  measure  of  transverse  variation, 
the  curves  in  Fig.  4a  would  coincide  when  plotted  as  a 
function  of  relative  dot  separation  dt/d,.  as  in  Fig.  4b. 
1  hose  curves  have  similar  shape  but  do  not  coincide. 
Thus,  dotted  lines  in  dense  patterns  are  somewhat  more 
easily  detected  than  in  sparse  patterns,  other  factors 
being  equal.  To  a  large  extent,  however,  the  important 
parameter  for  this  range  of  dot  densities  is  the  relative 
separation  of  target  dots  compared  to  surround  dots. 
Note  that  for  nearly  straight  lines  (vp  &  0)  the  dotted 
line  can  be  detected  essentially  when  the  target  dots 
are  closer  together  than  the  surround  dots.  These  re¬ 
sults  apply  over  a  range  of  a  factor  of  2.2  in  d,,  or  a 
factor  of  4.8  in  dot  density. 


A  previous  series  of  experiments,  reported  by  Vistnes 
[15],  used  similar  stimuli  except  that  the  random  sur¬ 
rounds  were  more  uniform  in  dot  density.  This  was 
done  by  randomly  perturbing  the  positions  of  a  regular 
grid  of  dots.  Also,  dots  in  the  surround  that  fell  near 
dots  in  the  target  were  erased.  The  dot  patterns  pro¬ 
duced  in  this  manner  have  no  spurious  clusters  of  dots 
but  still  appear  quite  random.  However,  the  proper¬ 
ties  of  such  distributions  are  more  difficult  to  analyze 
than  those  of  a  purely  random  (Poisson)  distribution. 
Hence,  I  repeated  the  experiments  using  a  purely  ran¬ 
dom  background. 

The  earlier  results  showed  that  longitudinal  variation 
vi  had  a  very  small  effect  on  detectability.  That  is, 
the  dots  in  the  target  did  not  need  to  be  evenly  spaced 
along  the  line.  Target  length  L  was  also  unimportant, 
so  long  as  there  were  at  least  five  or  six  dots  in  the  line; 
adding  dots  to  the  line  by  increasing  its  length  did  not 
increase  its  detectability.  To  a  large  extent  the  amount 


3.3  Experiment  II 

For  curved  targets,  the  important  parameter  is  arc 
height  //,  or  equivalently,  curvature  k.  The  exper¬ 
iments  measured  the  maximum  curvature  that  could 
be  detected  with  a  fixed  d,  and  d,.  The  results  are 
shown  in  Fig.  5a.  The  maximum  detectable  curvature 
increases  as  <lt/ds  decreases. 

Again,  we  inquire  if  these  results  are  independent  of 
scale.  Fig.  5b  shows  the  results  as  a  function  of  dt/d,. 
The  curves  have  similar  shape  (this  time  nearly  linear 
with  similar  slopes),  but  there  is  again  a  tendency  for 
curvature  in  dense  patterns  to  be  more  easily  detected 
than  in  sparse  ones. 

4  A  Model  for  Dotted  Line  Detection 

I  now  describe  a  statistical  model  that  accounts  in 
a  qualitative  way  for  these  experimental  data.  The 


852 


Figure  4.  (a)  Experimental  maximum  transverse  variation  vt  as  a  function  of  target  dot  separation  d%.  Parameters  are: 
L  —  200,  vl  =  0;  d,  ranges  from  10  to  22.  Error  bars  represent  one  standard  deviation,  (b)  The  same  curves  plotted  as  a 
function  of  relative  target  dot  separation  di/d,. 


I  h 


At 

•  1« 

•  I? 
A  »!» 
O  I  9 
O  ?• 


If  I?  14  It  19 


T4'9«t  Dot  Stpirtt'O”  0t 


figure  5.  (a)  Experimental  maximum  arc  curvature  k  as  a  function  of  target  dot  separation  d,.  Plotted  curvature  is  1000 
imes  actual  curvature,  for  ease  of  reading.  Arc  length  L  =  200.  Error  bars  are  one  standard  deviation  (b)  The  same  curves 
is  a  function  of  relative  dot  separation  dt/d.  Error  bars  are  omitted  for  clarity. 


853 


model  is  based  on  comparing  the  dot  density  in  a  cen¬ 
tral  target  region  with  that  in  a  surrounding  region. 
The  model  uses  operators  that  compute  the  probabil¬ 
ity  of  the  number  of  dots  in  the  center  region,  based  on 
an  estimate  of  the  density  of  dots  in  the  surrounding 
region.  The  smaller  this  probability,  the  less  likely  that 
event  is  to  have  occurred  at  random  —  i.e.,  the  more 
likely  it  is  to  be  non-accidental  in  origin. 

The  foundations  of  this  approach  have  been  laid  by 
Lowe  [8,10],  Binford  [3],  and  VVitkin  and  Tenenbaum 
[18].  They  claim  that  the  goal  of  perceptual  organi¬ 
zation  is  to  find  non-accidental  structures  in  images. 
Lowe  suggests  that  perceptual  organization  be  viewed 
a s  a  process  which  assigns  a  degree  of  significance  to 
various  groupings  of  image  features.  Lowe’s  and  VVitkin 
and  Tenenbaum’s  notion  of  non-accidentalness  involves 
estimating  the  prior  probability  of  occurrence  of  a  pat¬ 
tern  in  the  image,  and  measuring  the  accuracy  with 
which  a  relation  (e.g.,  collinearity)  holds.  The  current 
approach  is  similar  in  spirit,  but  makes  no  use  of  prior 
probabilities  as  theirs  does. 

For  the  model  to  have  relevance  to  real  images,  we 
must  make  a  broadly  applicable  assumption  about  the 
distribution  of  dots  in  the  image.  One  general  assump¬ 
tion  is  that  the  dots  are  independently  and  uniformly 
spaced;  this  dot  distribution  is  well  modeled  by  a  Pois¬ 
son  random  distribution. 

In  detail,  consider  a  rectangular  test  region  in  the 
image,  of  length  L  and  width  w,  and  a  surround  re¬ 
gion  surrounding  it.  Suppose  there  are  nc  dots  in  the 
central  test  region.  We  form  the  null  hypothesis  H0'- 
the  dots  in  the  center  region  and  in  the  surround  come 
from  the  same  distribution,  and  test  it  against  the  hy¬ 
pothesis  H i'.  the  dots  in  the  center  region  come  from  a 
distribution  with  greater  mean.  From  the  surround  we 
estimate  the  surround  dot  density  p.  We  can  calculate 
the  probability  P,iadom(^c)  of  nc  dots  by  the  Poisson 
probability 

n  /_  *  _  exP (-pvL)(pwL)*' 

*  undo laK^c)  —  * 

7ic . 

The  expected  number  of  dots  in  the  region  is  pu)L\ 
as  nc  increases  above  this  mean  value,  the  probability 
that  this  event  occurred  at  random  decreases,  and  the 
probability  it  was  non-accidental  increases.  We  can 
plot  the  probability  Pr™  dorr,  (’-o'1  us  a  function  of  nc,  for 
given  p,  w  and  L.  If  we  keep  L  constant  and  increase 
w,  the  mean  number  of  dots  in  the  region  increases,  as 
does  the  variance  of  the  distribution.  Some  members 
of  this  family  of  curves  are  shown  in  Fig.  6. 

Suppose  we  choose  a  criterion  of  a  (say,  0.001)  for 
statistical  significan  e.  For  any  width  w,  we  can  find 
an  such  that  when  nc  >  n„  P,in,inm(nc)  <  ».  As 


Figure  6.  Pr»ndom  (»t)  as  a  function  of  nc  for  several  widths 
w.  The  widths  range  from  8  to  20;  as  the  width  increases, 
the  curve  shifts  to  the  right  and  widens.  Other  parameters 
are:  p  =  1/20J  =  .0025  (corresponding  to  d,  —  20)  and  L  — 
200.  Note  that  the  required  n„  for  which  P,»ndi>m(n«,)  <  a 
increases  as  ui  increases. 

u1  increases,  so  does  the  nw  for  which  statistical  sig¬ 
nificance  is  reached:  that  is,  more  dots  are  required  to 
maintain  the  same  level  of  significance. 

5  Predictions  of  the  Model 

Here  I  show  how  we  can  use  the  model  to  make  predic¬ 
tions  about  line  and  curve  detection. 

5.1  Lines 

Our  goal  is  to  find  how  the  maximum  vt  varies  with 
target  dot  separation  d,.  Suppose  that  some  of  the 
dots  in  the  test  region  were  generated  by  a  process  like 
the  one  used  in  the  experiments  described  above:  they 
are  evenly  spaced  in  a  line  of  length  L,  separated  by 
d t  and  have  a  transverse  variation  vt-  The  width  of 
such  a  target  is  dt vj,  and  it  contains  \L/dt~\  dots.  In 
order  for  all  the  target  dots  to  fall  in  the  test  region, 
we  require  w  >  j-tVT  or  vj  <  w/dt\  this  provides  the 
maximum  vt  for  these  conditions. 

Some  dots  from  the  surround  will  fall  in  the  test  re¬ 
gion  as  well;  the  expected  number  of  such  dots  is  pwL. 
If  there  are  nu  dots  in  the  test  region,  then  the  num¬ 
ber  of  dots  from  the  target  is  m  =  nw  —  pxrL.  We 
then  calculate  at  =  L/m  and  hence  the  maximum  t"r 
is  w/d,  =  wmj L.  Thus,  if  we  know  the  number  of  dots 
nw  required  to  fall  in  the  region  for  significance,  we 
can  find  d,  and  the  maximum  vt  for  the  corresponding 
dotted  target  line. 

If  we  consider  a  number  of  widths  w  and  fixed  L, 
for  a  fixed  surround  density  p  =  1  /d2t,  we  can  find  dt 
and  maximum  vt  for  each  of  these  widths.  Collect¬ 
ing  these  results,  we  obtain  a  curve  of  maximum  t  j 


854 


versus  dt  just  as  in  the  human  experiments.  We  can 
compare  these  predictions  with  experimental  results.  I 
plot  maximum  variation  v?  versus  d,  in  Fig.  7a;  the 
several  curves  correspond  to  several  different  surround 
densities.  Fig.  7b  shows  the  same  results  as  a  func¬ 
tion  of  relative  dot  spacing  dt/d,.  When  plotted  in 
this  manner,  the  curves  exhibit  the  same  shape  but  do 
not  coincide.  The  prediction  is  that  lines  in  dense  sur¬ 
rounds  should  be  somewhat  easier  to  detect  than  lines 
in  sparse  backgrounds.  Comparing  Fig.  7  with  Fig.  4, 
we  see  that  this  prediction  matches  experimental  re¬ 
sults,  but  that  this  effect  is  less  pronounced  in  human 
vision. 


5.2  Curves 

First,  note  that  rectangular  test  regions  can  be  used  to 
detect  dotted  curves.  When  the  arc  curvature  is  low,  a 
narrow  region  suffices  to  detect  the  arc;  as  the  curva¬ 
ture  increases,  an  increasingly  wide  region  is  necessary. 
As  the  region  widens,  more  dots  are  required  to  fall  in 
it  in  order  to  maintain  the  same  level  of  significance, 
so  the  dots  in  the  arc  must  be  closer  together.  To  cal¬ 
culate  the  maximum  curvature  for  arcs,  I  use  the  same 
sort  of  analysis  as  applied  above  to  line  detection. 

Consider  a  circular  arc  passing  through  a  region  of 
length  L  and  width  w,  as  depicted  in  Fig.  8.  The 
arc  of  maximum  curvature  is  obtained  when  the  arc 
is  situated  as  shown.  The  radius  of  the  arc  can  be 
computed  by  analogy  to  (1):  R  =  [(L/2)2  +  w2]/2w, 
and  the  curvature  is  k.  =  1  / R.  The  angle  8  \s  6  — 
tan -‘[(L/2)/(* -«/)]. 

As  before,  we  determine  the  number  of  dots  nw 
needed  to  fall  in  the  region  in  order  to  ensure  a  sta¬ 
tistical  significance  criterion  of  a.  For  significance  we 
need  m  =  nw  -  pwL  target  dots,  since  an  expected  pwL 
dots  from  the  surround  fall  in  the  region  as  well.  The 
dot  separation  d,  can  then  be  computed  as  dt  «  2 R8/m 
(assuming  R  »  d, ). 

Thus,  given  a  test  region  of  size  Lxw  and  a  surround 
density  p  -  1/d2,,  we  can  calculate  the  dot  spacing  d, 
and  maximum  curvature  «.  Varying  w  yields  a  number 
of  predicted  data  points;  the  results  for  several  values 
of  d,  are  shown  in  Fig.  9a.  As  expected,  the  predicted 
maximum  curvature  increases  as  the  dots  in  the  arc  fall 
closer  together.  Fig.  9b  shows  the  predicted  maximum 
curvature  versus  dot  separation  ratio  dt/d,.  Note  that 
the  curves  are  nearly  coincident,  but  again  there  is  a 
tendency  for  targets  in  dense  patterns  to  be  more  easily 
detected  than  in  sparse  backgrounds.  Human  experi¬ 
ments  show  the  same  effect;  compare  Fig.  5. 


Figure  8.  A  dotted  circular  arc  passing  through  a  region 
of  length  L  and  width  w.  The  arc  radius  is  R  and  the  angle 
between  two  dots  in  the  arc  is  4>. 


Figure  10  Geometry  of  an  operator  of  the  type  used  in 
the  simulations. 

6  Computer  Simulations 

To  determine  how  well  operators  such  as  those  pro¬ 
posed  here  would  actually  perform  in  a  line-detection 
task,  I  performed  computer  simulations  in  which  dot¬ 
ted  line  stimuli  similar  to  those  presented  to  human 
subjects  were  presented  to  a  simulation  system.  The 
simulation  system  used  operators  of  various  sizes  to  try 
to  detect  the  target.  This  section  describes  the  simu¬ 
lations  and  their  results.  The  details  of  the  simulation 
setup  appear  in  Appendix  A. 

The  simulations  are  based  on  the  assumption  that 
the  human  visual  system  contains  line-detection  oper¬ 
ators  of  various  lengths  and  widths  at  each  position 
in  the  visual  field.  As  is  commonly  done  in  signal  de¬ 
tection  theory  [6],  I  model  the  forced-choice  detection 
task  as  follows.  The  program  calculates  the  output  of 
each  operator  at  each  position,  and  the  response  of  the 
system  for  one  trial  corresponds  to  the  position  of  that 
operator  with  the  largest  output.  The  program  records 
detection  accuracy;  as  vt  increases,  performance  dete¬ 
riorates,  as  expected. 

Each  operator  in  the  simulation  (see  Fig.  10)  counts 
the  number  of  dots  in  its  center  region  and  in  its  sur¬ 
round,  nc  and  n,,  and  estimates  the  surround  density 


855 


2.3 


2.3 


2.23 

2.1 

M 

A  1.73  - 

X 

1  1.3  ' 

m 

• 

V  !••  * 

T 

•  .73  - 

•  3 

•  .23 


*  <*> 

A 

A 


& 

o 

A  CP 

A  O 

\  <P 


2.23 

2.« 

M 

A  1.73 
x 

1  1.3  - 

m 

U  1.23 


1  •  12  14  it  18  28  22  24  24  29 
Dot  Saparatlon  at 


•  •73 
8.3 

•  23  ■ 

• 


«>»  .  * 

<  •  * 

f.:  • 

o  a*  • 

. 

°*  •  . 

^  A  ■  + 

°A  *  # 

V* 


•  .4  8. A  8.8  1.8  1.2  1.4 

R«>atlv«  Dot  Separation  dt/di 


Figure  7.  Predicted  maximum  transverse  variation  vt  as  a  function  of  target  dot  separation  d%.  (a)  The  results  for  several 
surround  densities,  from  left  to  right:  d,  =  10,  15,  20,  25;  p  —  1  /d1,.  Probability  criterion  was  a  =  0.C001,  L  =  200.  (b)  The 
same  results,  plotted  as  a  function  of  dt/d, 


Dot  Saparation  at 


ftaiativt  Dot  Separation  dt/d> 


Figure  9.  Predicted  maximum  curvature  as  a  function  of  target  dot  separation.  Plotted  curvature  is  1000  times  actual 
curvature,  for  ease  of  reading.  L  =  200,  a  =  0.0001.  (a)  Results  for  several  surround  densities  are  shown.  Prom  left  to  right, 
d,  =  10,  15,  20,  25.  (b)  The  same  results  are  plotted  as  a  function  of  relative  dot  separation  dt/d,. 


855 


p  =  n,/(2w,L).  It  then  calculates  the  probability  of 
those  nc  center  dots,  assuming  a  Poisson  distribution 
for  the  surround,  using 


r,  /  ,  exp(-2pwcL)(2pwcL)n‘ 

*r*ndom  l  nci  n*  )  —  , 


Now 


2  pwcL  =  2 


w 

n, — 
w. 


so  that 


(3) 


with  width  ratio  r  =  wc/w,.  The  operators  in  the  sim¬ 
ulations  compute  the  negative  logarithm  of  (3).  Since 
the  simulations  use  only  the  operator  with  maximum 
response,  and  this  is  a  monotonic  transformation,  it  has 
no  effect  on  the  results. 

I  collected  the  simulation  results  for  several  values 
of  d,,  as  in  the  human  experiments,  and  used  the  80% 
threshold  to  find  the  maximum  vt  for  each  value  of  d, 
and  dt.  These  results  are  shown  in  Fig.  11.  Note  that 
the  results  are  not  scale  invariant:  targets  in  dense  dot 
patterns  (e.g.,  d,  =  8)  are  more  easily  detected  than 
those  in  sparse  patterns  (e.g.,  d,  =  16).  This  effect 
seems  somewhat  more  pronounced  than  in  the  human 
results,  however;  compare  Fig.  4b. 


6.1  Comparison  with  other  operators 

It  is  of  interest  to  compare  the  results  obtained 
with  these  significance-estimating  operators  with  other 
kinds  of  operators.  Difference-of-Gaussians  (DOG)  op¬ 
erators  are  commonly  used  in  computer  vision  pro¬ 
grams  for  edge  detection,  and  Zucker  [20,22]  has  used 
them  for  orientation  selection.  For  computational  ef¬ 
ficiency,  the  difference-of-Gaussians  is  sometimes  ap¬ 
proximated  by  a  difference-of-boxes  (DOB)  operator 
[3].  The  output  of  these  operators  is  their  convolution 
with  the  image;  they  are  linear  filters  that  compute 
the  difference  of  two  convolutions.  The  output  of  the 
significance  operator  cannot  be  expressed  as  a  convolu¬ 
tion.  The  computational  details  of  the  DOG  and  DOB 
operators  are  relegated  to  Appendix  B. 

I  ran  some  additional  simulations  with  the  difference- 
of-Gaussians  and  difference-of-boxes  operators.  Each  of 
the  three  types  of  operator  was  evaluated  on  the  same 
dot  pattern;  each  operator  type  had  the  same  variety 
of  lengths  and  widths  at  four  positions.  Accuracy  of 
detection  was  recorded  as  before.  Some  typical  results 
are  shown  in  Fig.  12;  these  are  raw  data  showing  accu¬ 
racy  versus  v y.  Results  for  other  stimulus  parameters 


Figure  12.  Results  of  the  three  kinds  of  operators  on  the 
same  set  of  stimuli.  Parameters  for  the  stimuli  were:  d,  = 
dt  =  12,  L  =  100.  Number  of  trials  for  each  data  point  is 
100.  Curves  for  other  parameter  values  show  similar  trends 


are  similar.  Results  for  the  three  operators  tire  gen¬ 
erally  quite  similar,  but  the  significance  operator  per¬ 
forms  somewhat  better  than  the  two  convolution  op¬ 
erators:  the  former  is  better  able  to  detect  linear  pat¬ 
terns.  Comparisons  of  maximum  vt  thresholds  show 
that  all  operators  have  the  same  qualitative  character¬ 
istics  (see  Fig.  11).  All  the  operators  can  detect  lines  in 
dense  dot  patterns  more  easily  than  in  sparse  patterns. 
The  outputs  of  the  DOG  and  DOB  can  be  viewed  as 
an  approximation,  perhaps,  to  a  calculation  based  on 
the  significance  of  the  target  line. 

7  Discussion 

The  results  presented  here  indicate  that  the  statisti¬ 
cal  model  is  consistent  with  human  performance  on 
the  preattentive  tasks  considered  in  my  experiments. 
That  does  not  imply,  of  course,  that  the  human  visual 
system  uses  this  mechanism,  or  even  if  it  does,  that 
it  uses  no  other  mechanisms.  I  will  be  cautious  and 
merely  suggest  that  in  this  forced-choice  task,  under 
these  impoverished  conditions,  a  computation  like  the 
one  suggested  here  may  be  in  use. 

The  results  also  support  the  hypothesis  that  humans 
use  elongated  operators  to  detect  curves,  at  least  preat- 
tentively.  However,  more  sophisticated  experiments 
than  those  reported  here  would  be  needed  in  order  to 
confirm  or  disprove  that  hypothesis.  Other  research 
[5,12,17]  provides  further  evidence  for  this  hypothesis. 

The  model  presented  here  does  not  explain  why  a 
dotted  line  with  many  dots  is  no  easier  to  detect  than 
one  with  just  five  or  six;  Uttal  [14]  notes  the  same  phe¬ 
nomenon.  The  model  does  explain  why  longitudinal 
variation  vi  has  little  effect  on  detectability. 


i 

I 


857 


Target  Dot  Soparation  o? 

Figure  11  (a)  Simulation  results  for  linear  targets  Curvi 

plotted  as  a  function  of  dt/d, 


Heiativo  Dot  Separation  flt/dJ 

correspond  to  constant  values  of  d,  (b)  The  same  results 


Regardless  of  the  psychological  validity  of  this  model, 
stronger  claims  can  be  made  about  its  relevance  to 
computer  vision.  Computing  the  probability  of  non- 
accidentalness  of  linear  patterns  is  a  powerful  way  of 
detecting  them  in  an  image.  This  method  approaches 
the  problem  of  how  to  determine  when  a  pattern  is  a 
line  and  when  it  is  not:  it  computes  the  statistical  sig¬ 
nificance  of  a  grouping  relative  to  the  local  background. 
For  straight  lines  with  little  noise  in  the  surround,  the 
significance  of  the  grouping  will  be  very  high.  For 
marginal  lines  such  as  the  one  depicted  at  the  bottom  of 
Fig.  1,  the  significance  will  be  much  lower.  Depending 
on  the  goals  of  the  vision  system,  it  may  be  useful  to 
look  only  for  highly  significant  groupings,  or  it  may  be 
necessary  to  look  for  even  marginally  significant  ones. 
The  mechanism  proposed  here  allows  us  to  do  either. 

The  reader  should  note  that  the  simple  predictions 
based  on  this  statistical  model  (e.g.,  Figs.  7  and  9) 
fail  to  consider  a  number  of  factors.  The  predictions 
do  not  consider  the  number  of  alternatives  for  target 
location,  nor  do  they  include  any  notion  of  choosing  be¬ 
tween  alternatives.  A  realistic  system  for  line  detection 
needs  to  measure  the  significance  of  line-like  structures 
at  several  positions  in  the  image,  at  several  lengths  and 
widths,  and  choose  the  position  corresponding  to  the 
‘best’  one  as  its  choice  in  the  forced-choice  experiment. 
The  predictions  of  this  model  indicate  only  the  qualita¬ 
tive  characteristics  of  a  system  using  the  significance¬ 
estimating  approach. 

Although  the  simulations  do  consider  these  factors, 
they  fail  to  consider  other  potentially  important  ones. 
First,  the  human  visual  system  does  not  know  precisely 
the  position  of  the  target  lines.  The  simulations  assume 
that  the  positions  of  the  targets  are  known  with  no  un¬ 
certainty,  and  that  other  operators  that  may  be  nearby 


are  ignored.  Including  the  effects  of  such  nearby  oper¬ 
ators  may  alter  the  results  of  the  simulations.  Second, 
the  simulation  system  includes  no  model  of  how  the 
outputs  of  the  operators  are  used  after  the  initial  de¬ 
tection  stage.  Any  vision  system  that  uses  such  line 
detectors  must  include  some  later  stage  to  further  pro¬ 
cess  their  output.  It  is  possible  that  adding  this  further 
stage  of  processing  will  result  in  a  system  that  behaves 
more  like  the  human  visual  system. 

One  of  the  contributions  of  this  paper  is  to  provide  a 
theoretical  foundation  for  finding  structures  by  measur¬ 
ing  their  significance  with  respect  to  the  surround.  The 
model  involves  no  thresholds,  as  some  other  models  do. 
I  have  argued  that  using  thresholds  in  this  case  is  mis¬ 
leading  and  not  useful;  it  is  more  valuable  to  measure 
the  probability  that  the  structure  arose  by  accident, 
and  reason  further  based  on  that  information.  Similar 
reasoning  can  be  applied  to  detection  of  other  features, 
like  edges;  it  is  the  class  of  pattern  being  detected  and 
the  statistical  significance  of  that  pattern  in  the  image 
that  matters. 

Many  questions  must  be  answered  before  this  model 
can  be  embodied  in  an  algorithm.  It  seems  clear  that 
there  will  be  many  detectors  with  various  sizes,  orien¬ 
tations,  and  widths  operating  at  each  image  position; 
ideally,  they  would  operate  in  parallel  for  speed.  The 
parameters  of  the  detectors  need  to  be  chosen.  Also, 
the  outputs  of  all  the  operators  must  be  combined  in 
some  coherent  fashion.  This  is  simplified  since  we  know 
exactly  what  the  operators  are  computing  from  the  im¬ 
age.  In  contrast,  some  existing  algorithms  take  an  ad 
hoc  approach,  with  arbitrarily  selected  operators  and 
arbitrary  parameters,  and  where  the  output  has  no 
clear  meaning;  these  linking  algorithms  and  the  opera¬ 
tors  they  use  often  have  little  theoretical  justification. 


858 


The  approach  suggested  here  promises  to  clear  up  some 
of  that  confusion. 

8  Related  Work 

Here  I  briefly  summarize  research  related  to  the  cur¬ 
rent  problem.  This  work  is  motivated  to  a  large  ex¬ 
tent  by  Lowe’s  work  [8,9,10]  on  finding  groupings  in 
images.  His  demonstration  [8]  that  widely-separated 
dots  cannot  easily  be  seen  when  surrounded  by  noise 
dots  provides  a  starting  point  for  the  psychophysical  ex¬ 
periments  reported  here.  Lowe’s  work  [8,9]  on  finding 
curvilinear  structure  in  images  was  designed  primarily 
to  find  structures  in  dots  that  have  already  been  linked 
on  the  basis  of  proximity,  rather  than  on  sparse  dotted 
lines  surrounded  by  noise. 

Lowe  [9]  claims  that  one  of  the  goals  of  percep¬ 
tual  organization  is  to  statistically  distinguish  be¬ 
tween  accidental  and  non-accidental  instances  of  a  re¬ 
lation.  Witkin  and  Tenenbaum  [18]  claim  that  the  non- 
accidentalness  principle  unifies  several  problems  in  per¬ 
ceptual  organization,  and  is  in  fact  the  general  goal  of 
image  organization.  One  fruitful  assumption  is  that 
the  purpose  of  perceptual  organization  is  to  detect  sta¬ 
ble  image  groupings  that  arise  due  to  structure  of  the 
scene  rather  than  accident.  In  the  context  of  group¬ 
ing  dots  into  lines  and  curves,  discriminating  between 
accident  and  non-accident  means  determining  the  like¬ 
lihood  that  the  structures  arose  due  to  random  place¬ 
ment  of  dots.  Structures  that  are  unlikely  to  be  acci¬ 
dental  have  a  causal  origin,  i.e.,  they  are  the  projection 
of  a  corresponding  structure  in  the  scene.  Lowe  notes 
that  “perceptual  organization  can  be  viewed  as  a  pro¬ 
cess  which  assigns  a  degree  of  significance  to  each  po¬ 
tential  grouping  of  image  features"  [8],  This  suggestion 
is  taken  rather  literally  in  this  paper,  where  each  opera¬ 
tor  computes  the  statistical  significance  of  the  structure 
within  its  receptive  field.  These  points  are  elaborated 
by  Lowe  [8]  and  Binford  [4,3]. 

Zucker  and  his  colleagues  [19,20,21,22]  distinguish 
between  Type  I  and  Type  II  processes  in  dot  group¬ 
ing:  Type  I  processes  infer  ID  contours,  while  Type  II 
processes  deal  with  2D  flow  fields.  Although  Zucker  ap¬ 
parently  does  not  consider  the  influence  of  background 
noise  on  the  inference  of  Type  I  contours,  the  problem 
considered  in  this  paper  falls  on  the  Type  I  side  of  his 
dichotomy. 

Zucker’s  papers  present  a  model  for  finding  oriented 
entities  corresponding  to  Type  I  or  Type  II  patterns. 
The  model  is  suggested  by  biology:  first,  convolution  of 
the  image  with  orientationally  selective  receptive  fields; 
and  second,  interpretation  of  the  outputs  of  those  op¬ 
erators.  The  operators  in  his  implementation  [22]  com¬ 


pute  the  difference  of  Gaussians;  they  have  several  arbi¬ 
trary  parameters,  and  their  output  lacks  a  clear  seman¬ 
tic  meaning.  In  contrast,  the  output  of  the  operators 
proposed  here  has  a  clear  interpretation.  It  is  difficult 
to  determine  how  Zucker’s  methods  would  perform  on 
the  kinds  of  patterns  considered  here.  Zucker  [20]  pro¬ 
poses  that  a  vector  field  of  orientations  be  found  in 
the  image  by  applying  these  operators  and  interpreting 
their  outputs;  from  this  vector  field,  he  proposes  that 
Type  I  contours  be  found  by  solving  a  set  of  differen¬ 
tial  equations.  This  procedure  assumes  that  there  is 
only  one  set  of  curves  passing  through  the  image,  that 
there  is  no  ambiguity.  It  is  often  the  case,  however, 
that  there  is  no  perceptually  unambiguous  way  to  link 
a  set  of  dots  (or  edgels);  there  may  be  many  ways  to 
impose  groupings  on  the  input.  (How  many  ways  are 
there  to  group  the  dots  in  Figs.  1  or  2  into  curves?) 
A  system  that  measures  the  significance  of  different 
groupings  allows  this  flexibility,  while  Zucker's  method 
does  not  seem  to. 

9  Summary  and  Conclusions 

I  presented  the  results  of  psychophysical  experiments 
that  show  the  effect  of  point  scatter  and  curvature  on 
our  ability  to  detect  dotted  lines  and  curves.  The  main 
factor  that  determines  our  ability  to  detect  such  pat¬ 
terns  is  the  relative  separation  of  target  dots  and  sur¬ 
round  dots,  but  there  is  a  significant  effect  of  density: 
we  can  detect  curves  with  more  jaggedness  and  curva¬ 
ture  in  dense  backgrounds  than  in  sparse. 

I  presented  a  statistical  model  for  line  and  curve  de¬ 
tection,  and  used  the  model  to  make  qualitative  pre¬ 
dictions  about  human  performance  in  this  task.  The 
model  is  based  on  estimating  the  density  of  dots  sur¬ 
rounding  an  elongated  center  region,  counting  the  num¬ 
ber  of  dots  in  the  center,  and  calculating  the  probabil¬ 
ity  that  those  dots  fell  in  the  center  region  at  random. 
'.Vhen  this  probability  is  low,  we  infer  that  the  dots 
in  the  center  region  have  a  non-accidental  origin,  i.e., 
correspond  to  some  structure  in  the  scene.  The  pre¬ 
dictions  based  on  this  model  agree,  within  their  limits, 
with  the  empirical  results.  Predictions  for  curve  detec¬ 
tion,  based  on  detection  of  lines,  agree  with  data  on 
human  curve  detection,  supporting  a  hypothesis  that 
curves  are  detected  by  line  detectors. 

I  discussed  computer  simulations,  in  which  line- 
detection  operators  of  the  type  proposed  here  are  ap¬ 
plied  to  a  forced-choice  line  detection  task.  The  results 
were  discussed  and  compared  to  human  results. 

The  model  discussed  here  promises  to  lay  a  theoret¬ 
ical  foundation  for  the  detection  of  lines  and  curves  in 
images.  In  a  future  paper,  I  will  discuss  the  imple- 


mentation  of  an  algorithm  based  on  this  model,  and  its 
application  to  the  task  of  finding  curvilinear  structures 
in  images. 


10  Acknowledgements 


APPENDIX  B.  Construction  of  DOG 
and  DOB  operators 

The  difference-of-Gaussians  operator  convolves  the 
function  G(x)  =  gc(x)-g,(x)  with  the  dot  image;  each 
dot  has  the  value  1.  The  functions  are  defined  by 

Pi(x)  =  Wi  exp (-fc2x2) 


My  advisor  T.  Binford  provided  many  insights  into  this 
problem,  helped  with  the  analysis,  and  carefully  read 
a  draft.  B.  Wandell  was  a  great  help  throughout,  espe¬ 
cially  with  the  psychometric  analysis,  and  supplied  the 
program  that  fit  psychometric  functions  to  the  data.  A. 
Witkin  helped  crystallize  some  of  the  ideas.  D.  Lowe 
provided  comments  on  a  draft.  E.  Isaacs  was  a  willing 
experimental  subject.  This  research  was  supported  by 
ARPA  contract  N00039-84-C-0211. 


APPENDIX  A.  Details  of  Simulation 
Setup 


The  simulation  setup  consists  of  a  number  of 
differently-shaped  rectangular  operators  at  each  of  four 
positions  and  orientations  corresponding  to  the  possi¬ 
ble  target  positions.  The  scale  of  this  simulation  is  ap¬ 
proximately  half  that  of  the  human  experiments:  the 
image  is  200  pixels  square,  and  the  target  length  L  is 
100.  At  each  position,  the  operators  come  in  a  variety 
of  lengths  and  widths.  There  are  four  operator  lengths, 
ranging  evenly  from  50  to  170.  For  each  length,  there 
are  four  widths  specified  as  a  fraction  of  the  length; 
these  fractions  range  evenly  from  0.01  to  0.08.  Thus 
there  are  16  operators  at  each  position,  64  altogether. 

The  surround  width  of  each  operator  is  fixed  at  three 
times  the  center  width.  This  is  a  compromise  between 
efficiency  and  accuracy;  a  larger  w,  slows  down  the  sim¬ 
ulations.  Also,  a  larger  surround  increases  the  chance 
of  including  some  qualitatively  different  area  of  the 
background.  There  is  a  tradeoff  between  random  er¬ 
ror  due  to  small  sample  size  (small  w,)  and  systematic 
error  due  to  sampling  the  wrong  region  [3],  A  typical 
operator  of  this  sort  appears  in  Fig.  10. 

It  is  possible  that  a  different  coarseness  of  length  and 
width  resolution  in  the  assortment  of  operators  might 
affect  the  simulation  results.  However,  pilot  experi¬ 
ments  showed  that  four  or  five  different  lengths  and 
widths  are  sufficient  for  full  performance;  using  more 
than  this  does  not  increase  performance  but  merely 
slows  down  the  simulations. 


where 


w,  =  l/\/2*<r;  and  k 2  =  l/2<72. 


The  difference-of-boxes  operator  computes  the  con¬ 
volution  H(x)  =  hc(x)  —  h,(x),  where  hc(x)  and  k,(x) 
are  defined  by 


Mx) 

A,(x) 


f  0,  if  1  x |  >  wc\ 

\  or,  if  |x|  <  wc. 

{0,  if  |x|  >  wc  +  w,\ 
0,  if  |x|  <  wc  +  w,. 


For  zero  mean,  we  require  awc  =  0{xuc  +  w,). 

For  a  reasonable  comparison  of  operator  perfor¬ 
mance,  it  is  necessary  to  match  the  center  and  sur¬ 
round  regions  of  each  operator  as  closely  as  possible. 
For  the  difference-of-boxes  operator  and  the  Poisson 
significance-estimating  operator,  this  is  easy  since  the 
regions  are  strictly  delimited.  For  the  difference-of- 
Gaussians  operator,  however,  the  surround  region  is 
theoretically  the  entire  image;  however,  dots  far  from 
the  center  are  given  very  low  weight  in  calculating  the 
response.  The  DOB  and  DOG  operators  are  differences 
of  weighted  sums  calculated  over  center  and  surround 
convolution  regions.  These  regions  can  be  matched  well 
in  size  by  making  the  standard  deviations  of  their  defin¬ 
ing  functions  equal. 

The  variance  of  a  function  /(x)  on  an  interval  [a,  6]  is 
defined  as  the  mean  squared  error  (x  —  p]2,  where  p  is 
the  average  value  of  x  in  [a,  £].  To  find  the  variance  of 
the  function  /(x)  =  a  on  [— w, «/],  we  note  that  p  =  0 
since  the  interval  is  centered  on  zero,  so  the  variance  is 
x2.  The  average  value  g(x)  of  a  function  g(x)  on  [a,  6] 
is  defined  as 

rhj. 

so  the  variance  of  /  is 

—  r  xHx = — — 

2ui  J_w  2 ui  3 

and  the  standard  deviation  is  just  u>  /  \/3.  Thus,  in 
these  comparison  simulations,  <rc  for  the  DOG  was  set 
to  wc/\/ 3  and  a,  was  set  to  (wc  +  w,  )N 5  =  4u>c  /\/3 


060 


(since  the  surround  width  was  three  times  the  center 
width  in  the  simulations). 

The  difference-of- boxes  operator  and  the  difference- 
of-Gaussians  operator  perform  similarly  since  they  have 
nearly  the  same  shape.  That  is,  since  they  are  con¬ 
structed  to  have  the  same  mean  and  same  variance 
(second  moment)  and  are  even  functions,  their  first 
and  third  moments  are  zero.  Thus,  the  first  moment  in 
which  they  differ  is  the  low  order  fourth  moment,  and 
they  perform  nearly  equivalently. 

References 

[1]  D.H.  Ballard  and  C.M.  Brown.  Computer  Vision. 
Prentice-Hall,  Englewood  Cliffs,  NJ,  1982. 

[2]  I.  Biederman.  Human  image  understanding:  Re¬ 
cent  research  and  a  theory.  Comp.  Vision,  Graph¬ 
ics,  and  Image  Proc .,  32(l):29-73,  1985. 

[3]  T.O.  Binford.  Figure/ground:  Segmentation  and 
aggregation.  In  O.J.  Braddick  and  A.C.  Sleigh, 
editors,  Physical  and  Biological  Processing  oj  Im¬ 
ages,  Springer- Verlag,  New  York,  1983. 

[4]  T.O.  Binford.  Inferring  surfaces  from  images.  Ar¬ 
tificial  Intelligence,  17:205-244,  1981. 

[5]  C.  Blakemore  and  R.  Over.  Curvature  detectors 
in  human  vision?  Perception.  3:3-7,  1974. 

[6]  C.  Coombs,  R.  Dawes,  and  A.  Tversky.  Mathe¬ 
matical  Psychology:  An  elementary  introduction. 
Prentice-Hall,  Englewood  Cliffs,  NJ,  1970. 

[7]  B.  Julesz.  Foundations  oj  Cyclopean  Perception. 
The  University  of  Chicago  Press,  London,  1971. 

[8]  D.  Lowe.  Perceptual  Organization  and  Vi¬ 
sual  Recognition.  Kluwer  Academic  Publishers, 
Boston,  1985. 

[9]  D-  Lowe.  Three-dimensional  object  recognition 
from  single  two-dimensional  images.  Technical 
Report  202,  Courant  Institute,  New  York  Univer¬ 
sity,  1986. 

[10]  D-  Lowe.  Visual  recognition  from  spatial  corre¬ 
spondence  and  perceptual  organization.  In  Proc. 
Int.  Joint  Con},  on  Art.  Intel!.,  pages  953-959, 
1985. 

[11]  V.S.  Nalwa  and  E.  Pauchon.  Edgel-aggregation 
and  edge-description.  In  Image  Understanding 
Workshop ,  1985. 


[12]  B.N.  Timney  and  C.  Macdonald.  Are  curves  de¬ 
tected  by  ‘curvature  detectors’?  Perception,  7:51  — 
64,  1978. 

[13]  A.  Treisman.  Preattentive  processing  in  vi¬ 
sion.  Comp.  Vision,  Graphics,  and  Image  Proc., 
2(31):156-  177,  1985. 

[14]  W.R.  Uttal,  L.M.  Bunnell,  and  S.  Corwin.  On  the 
detectability  of  straight  lines  in  visual  noise.  Per¬ 
ception  and  Psychophysics,  8(6):385-388,  1970. 

[15]  R.  Vistnes.  Detecting  structure  in  random-dot 
patterns.  In  Image  Understanding  Workshop, 
1985. 

[16]  A.  Watson.  Probability  summation  over  time.  Vi¬ 
sion  Res.,  19:515-522,  1979. 

[17]  H.R.  Wilson.  Discrimination  of  contour  curvature: 
Data  and  theory.  J.  Opt.  Soe.  Am.,  A  2(7):  1191  — 
1198,  1985. 

[18]  A.P.  Witkin  and  J.M.  Tenenbaum.  On  the  role 
of  structure  in  vision.  In  O.J.  Braddick  and  A.C. 
Sleigh,  editors,  Physical  and  Biological  Processing 
oj  Images,  Springer- Verlag,  New  York,  1983. 

[19]  S.W.  Zucker.  Computational  and  psychophysical 
experiments  in  grouping:  Early  orientation  selec¬ 
tion.  In  J.  Beck,  B.  Hope,  and  A.  Rosenfeld, 
editors,  Human  and  Machine  Vision,  Academic 
Press,  New  York,  1982. 

[20]  S.W.  Zucker.  Cooperative  grouping  and  early  ori¬ 
entation  selection.  In  O.J.  Braddick  and  A.C. 
Sleigh,  editors,  Physical  and  Biological  Processing 
oj  Images,  Springer- Verlag,  New  York,  1983. 

[2 1  ]  S.W.  Zucker.  The  diversity  oj  perceptual  group¬ 
ing.  Technical  Report  TR-85-1R,  Comp.  Vision 
and  Robotics  Lab,  McGill  University,  April  1985. 

[22]  S.W.  Zucker.  Early  orientation  selection:  Tan¬ 
gent  fields  and  the  dimensionality  of  their  sup¬ 
port.  Comp.  Vision,  Graphics,  and  Image  Proc., 
32(I):74-103,  1985. 

[23]  S.W.  Zucker,  K.A.  Stevens,  and  P.  Sander.  The 
relation  between  proximity  and  brightness  similar¬ 
ity  in  dot  patterns.  Perception  and  Psychophysics, 
34(6):513— 522,  1983. 


861 


LEARNING  SHAPE  COMPUTATIONS 


John  (Yiannis)  Aloimonos 
David  Shulman 

Center  for  Automation  Research 
University  of  Maryland 
College  Park,  MD  207*42 


ABSTRACT 

Descriptions  of  physical  properties  of  surfaces  that 
we  see,  such  as  their  shape,  distance,  boundaries  and 
discontinuities,  must  be  recovered  from  the  image  data.  A 
large  part  of  research  in  computational  vision  aims  to 
understand  how  such  descriptions  can  be  obtained  from 
noisy  and  ambiguous  data.  We  view  these  problems  as 
ill-posed  problems  and  we  present  a  unified  theory  for  the 
computation  of  shape  from  several  cues,  such  as  shading, 
texture,  contour  or  motion.  We  study  how  a  connectionist 
network  (neural  network)  can  learn  all  the  parameters 
that  are  involved  in  the  computation  of  shape  from  any 
cue.  and  then  how  it  can,  through  an  iterative  technique 
that  converges  to  a  unique  value,  compute  the  solution  of 
any  "shape  from  X"  problem. 

1.  INTRODUCTION 

One  of  the  main  goals  of  computational  vision  is  to 
develop  image  understanding  systems  that  automatically 
construct  scene  descriptions  from  the  image  data  (input). 
Early  vision  consists  of  a  set  of  proc:  ,ses  that  recover 
physical  properties  of  the  visible  three  dimensional  sur¬ 
faces  from  the  two  dimensional  intensity  arrays.  Such 
properties  include  shape,  distance,  reflectance  and  others. 
On  the  other  hand  many  cues  are  available  and  they  are 
used  by  different  modules  in  the  recovery  process.  Sue  . 
cues  are  texture,  contours,  shading,  color,  stereo  and 
motion.  Much  current  research  has  analyzed  processes  of 
early  vision,  such  shape  from  shading  [16],  shape  from 

texture  [17,  18,  19],  shape  from  motion  (kinetic  depth 
effect)  [20]  and  shape  from  contour  [21,  22].  Several  suc¬ 
cessful  algorithms  have  been  proposed  for  the  above  prob¬ 
lems  that  worked  well  for  the  assumptions  that  they  were 
based  on.  In  general,  all  these  different  "shape  from  X” 
problems  have  been  studied  separately,  and  only  with 
respect  to  one  or  two  levels  of  a  visual  system,  as  defined 
by  Marr  [23] . 

We  show  here  that  all  the  problems  of  computing 
shape  from  shading,  texture,  contour  and  motion  may  be 
solved  by  the  same  neural  network  that  iteratively  con¬ 
verges  to  a  unique  solution.  The  network  solves  the 


different  problems  merely  by  changing  the  weights  in  the 
connections  of  the  different  units.  In  the  course  of  our 
analysis  we  regularize  some  ill-posed  problems  that  are 
nonlinear  and  for  which  standard  regularization  tech¬ 
niques  [2-4]  are  not  applicable. 

2.  MOTIVATION 

Ail  the  problems  of  shape  from  shading,  contour, 
texture  and  motion  have  the  characteristic  that  they  are 
ill-posed  in  the  sense  of  Hadamard.  A  problem  is  well- 
posed  when  its  solution  exists,  it  is  unique  and  depends 
continuously  on  the  initial  data.  Ill-posed  problems  fail  to 
satisfy  one  or  more  of  these  criteria.  The  main  idea  of 
"solving”  ill-posed  problems,  i.e.  to  restore  "well- 
posedness”.  is  to  introduce  suitable  a  priori  knowledge 
that  will  restrict  the  space  of  admissible  solutions.  A 
priori  knowledge  can  be  exploited,  for  example,  under  the 
form  of  constraints  on  the  possible  solutions.  The  regu¬ 
larization  of  the  ill-posed  problem  of  finding,  w  (the 
world)  from  the  data  i  (image)  and  the  constraint 
A w  =  t  when  A  is  not  invertible  (ill-posed  problem), 
amounts  to  choosing  a  norm  ||||  and  a  functional  Pw, 
and  then  applying  an  optimization  technique.  For  exam¬ 
ple.  we  can  find  the  w  that  minimizes  the  function 
|[A-.  -  t'||2  +  X||Pu>||2  ,  where  X  is  the  so  called  regulari¬ 
zation  parameter.  This  approach  was  initiated  in  com¬ 
puter  vision  by  Poggio  and  his  colleagues  [25].  A  should 
be  a  linear  operator,  the  norms  quadratic  and  P  linear. 
When  the  constraint  is  nonlinear,  and  it  is  nonlinear  most 
of  the  time,  then  finding  a  unique  solution  is  not 
guaranteed  [28].  Finally,  our  learning  scheme  was 
motivated  by  Poggio’s  work  [37],  where  it  was  observed 

that  regularizing  operators  might  be  learned  without  the 
need  for  explicit  variational  (smoothness)  conditions. 

The  organization  of  the  paper  is  as  follows:  The  next 
section  develops  the  computational  theory  of  shape  com¬ 
putation  from  contour,  texture,  motion  and  shading,  i.e. 
what  are  the  constraints  that  govern  the  particular  prob¬ 
lem.  In  other  words  we  develop  the  first  level  of  a  visual 
system  (as  proposed  by  Marr).  Section  4  develops  an  algo¬ 
rithm  for  shape  computation  and  studies  its  properties, 
i.e.  the  second  level  of  a  visual  system.  Finally,  Section  5 
discusses  implementation  of  shape  computation  algo¬ 
rithms  (the  third  level)  and  develops  a  theory  for  learn- 


862 


ing  how  to  compute  shape  from  different  kinds  of  cues  in 
a  neural  net. 

3.  COMPUTATIONAL  THEORY  OF  SHAPE 

Here  we  study  the  physical  constraints  that  are 
involved  in  the  computation  of  shape  from  different  cues. 
Each  case  is  studied  separately. 

3.1.  Computational  theory  of  shape  from  shad¬ 
ing 

This  theory  has  been  developed  by  Horn  and  his  col¬ 
leagues  [16].  The  treatment  is  done  under  orthography 
(actually,  in  the  perspective  case  the  same  laws  apply,  as 
noted  in  [26]).  According  to  this  theory,  the  intensity  at  a 
point  ( x,y )  of  an  image  is  given  by  the  formula 

I(x,y)  =  psii 

where  p  is  a  constant  depending  on  the  material  of  the 
surface  in  view,  s’  is  the  direction  of  the  point  light 
source  and  rt  the  surface  normal  of  the  surface  point 
whose  image  is  the  point  (x,y).  If  we  are  trying  to  com¬ 
pute  shape,  then  we  really  are  after  two  parameters  (p,q) 
at  every  image  point  (x,y),  where  p,q  is  the  gradient  of 
the  surface  normal  vector  at  the  surface  point  whose 
image  is  ( x,y ).  The  above  equation,  when  expressed  in 
terms  of  the  gradient  (p,q  ),  becomes 

,,  ,  P(PP ,  +  +  0 

where  s'  =  (p,  ,qs,-\).  It  is  obvious  that  at  every  point  of 
our  image  we  have  two  unknowns  and  only  one  con¬ 
straint. 

3.2.  Computational  Theory  of  Shape  from 
Motion 

Here  we  investigate  the  constraints  imposed  by 
motion.  This  problem  has  been  extensively  studied  in  the 
field  of  structure  from  motion  or  kinetic  depth  effect,  for 
both  orthography  and  perspective  projection  and  for 
discrete  or  continuous  motion.  Here  we  use  orthography 
and  discrete  motion.  Also,  we  consider  the  object  in  view 
as  consisting  of  features,  i.e.  points  of  interest;  so  we 
speak  of  the  object  as  a  cloud  of  points  that  moves. 
Furthermore,  only  rigid  motion  is  considered.  In  the 
pionnering  work  of  Ullman  it  was  proved  that  three 
orthographic  projections  of  four  noncoplanar  points  admit 
one  solution  for  the  structure  and  motion  of  the  points.  It 
was  stated  in  [23]  that  two  orthographic  projections  of 
any  number  of  points  admit  infinitely  many  interpreta¬ 
tions  for  the  structure  of  the  points.  This  was  formally 
proved  in  [27],  correcting  a  previous  result.  Here,  we 
study  the  constraint  between  displacements  of  points  and 
local  shape.  It  turns  out  that  this  constraint  is  of  the 
same  form  as  in  the  shape  from  shading  case,  i.e.  a  conic 
in  the  gradient  (p,q).  For  this  we  need  the  following 
lemma: 

.Lemma  : 

’Given  two  distinct  orthographic  projections  of  three  points 

i 

i 

I 


'{l  +  p-  +  q2)  (1  +  ps 2  +  qs2) 


in  a  rigid  configuration,  the  gradient  (p,q)  of  the  plane 
that  tlic  three  points  define  (with  respect  to  the  coordinate 
system  of  the  first  frame),  lies  on  a  conic  section  in  gra¬ 
dient  space  The  coefficients  of  this  conic  section  depend 
entirely  on  the  interframe  displacements  of  the  points 

Proof: 

Let  the  three  points  in  space  be  O.A.B  in  their  first 
positions  and  0',A',B'  in  their  second  positions  and 
their  projections  in  the  two  frames  be  Ol,Al,B\  and 
O ,, A  respectively.  Let  also  the  gradient  of  the 

plane  OAB  be  G  =  ( p.q  ).  Furthermore,  let 

0,74  ,  =  0*1  =  (xvy{)  (1) 

0,2,  =  K  =  (cj.rft)  (2) 

02742  =  a2  =  (xsub2,y2)  (3) 

OT&.,  =  j?2  =  (csub2,d2)  (4) 


Considering  the  geometry  of  the  first  projection 
(OAB  to  O ,.4  ,B,),  we  have 

07l  =  (*,,2/,, <?.«,)  (5) 

Ob  =(c„d„g/l)  (6) 

Similarly,  considering  the  second  projection  (OAB  to 
OsA-Mo),  we  get 

OrA' =  (x2,y2,\)  (7) 

0^'  =  (c2,d2,p)  (8) 

where  X  and  p  are  to  be  determined. 

But.  because  of  the  rigid  motion,  the  vectors  0.4  and 
Or4 '  have  the  same  length.  The  same  holds  for  the  vec¬ 
tors  Oh  and  0Th'.  From  these  requirements  we  get 

X  =  ty/Sf+tfStf-Z, {  (9) 

,i  =  ±y/0\2+(GO\f--0Y  0°) 

Finally,  again  because  of  the  rigidity,  the  angles  between 
the  vectors  0?4  ,02?and0  '.4  '.0  'B'  are  the  same.  From 
this,  we  get: 

0.4  oh  =  O'A'  O'B'  (11) 

where  .  denotes  the  dot  product  operation. 

Substituting  in  equation  (11)  from  equations  (5),  (6), 
(7),  (8),  (9),  (10),  we  get 

a,. #+(S.a, )(£/,)  =  a2,ff,±\p 

and  substituting  the  values  for  X  and  p  and  squaring 
appropriately,  we  get 


Ciiven  that 


(•«!  =  p*i+?yi 


353 


<?.#  =  pc  y  +  qdy 
the  above  equation  (12)  is  of  the  form 

Ap2+Bq2+Tpq+A  =  0 

where  the  coefficients  A,B,r,A  depend  on  the  image 
vectors  av  a2 , and  f?2  (q.e.d.). 

We  can  now  easily  establish  the  relationship  between 
retinal  displacements  and  local  shape,  by  applying  the 
previous  lemma  for  the  points  (x,y),  (x  +  dx,y), 

(x,y  +  dy).  Then,  assuming  that  the  surface  is  locally 
planar  and  that  the  plane  is  defined  by  the  three  points 
in  the  world  whose  images  are  the  points 

( x,y),(x  +  dx,y),(x,y  +  dy),  we  can  derive  a  constraint 
between  local  shape  and  retinal  motion. 

Consider  a  moving  surface  Z  =  Z(x,y)  and  let 
( Au(x,y),  At>(x,y)j  be  the  discrete  displacement  field 
for  two  time  instants  and  t2  w>th  <  <2>  Le-  an 
image  point  is  at  the  position  (x,y)  at  time  t  ,,  then  at 

time  t2  it  will  be  at  the  position  (x  +  Au(x,y), 

y  +  At’(x,y)|.  Then,  assuming  that  the  surface  is 
locally  (differentially)  planar,  i.e.  the  points  (x,y), 
(x  +  dx,  y),  (x,y  +  dy)  define  a  plane,  we  can  prove  that 
the  gradient  ( p,q )  at  a  surface  point  whose  projection  on 
the  image  plane  is  the  point  (x,y),  satisfies  the  following 
conic  constraint: 

klp2  +  k2q2  -  2 k3  pq  +  k4  =  0 

with 

k l  =  0i  -  *t)dx2 
k,  =  02  -  V2  )dy2 

k3  =  00y  -  cf2/?2)<fx  dy 

kj  =  («'f  -  o’!  )0{  -  ~$2)  -  -  ^02)2, 


then  the  problem  could  be  approached  through  skewed 
symmetry  constraints.  Research  in  [34)  dealt  with  the 
case  of  planar  surfaces  and  did  not  assume  any  pattern 
symmetry.  Finally,  research  in  [35]  attempted  the  solu¬ 
tion  of  this  problem  under  the  assumption  that  ail  pat¬ 
terns  should  be  of  the  same  area  (which  is  true  when  all 
the  patterns  are  the  same).  Symmetry  was  not  a  require¬ 
ment.  Furthermore,  every  texel  is  assumed  to  lie  on  a 
planar  surface,  and  this  moans  that  the  size  of  the  texels 
on  the  surface  has  to  be  small  compared  with  the  change 
of  surface  orientation.  Also,  in  this  work  the  scene  was 
assumed  to  be  projected  onto  the  retina  under  an  approx¬ 
imation  of  perspective  projection,  called  paraperspective. 
Paraperspective  projection  is  analyzed  in  [27]  and  its 
approximation  error  is  calculated.  For  the  purposes  of 
this  section,  we  will  only  state  the  constraint  relating  the 
area  of  a  world  texel  to  the  area  of  its  image.  Let  5/  be 
the  area  of  a  texel  in  the  image  plane.  Let  Syy  be  the  area 
of  the  texel  on  the  world  surface.  Let  p,q  the  gradient  of 
the  plane  on  which  the  world  texel  lies,  and  (A,B)  the 
center  of  mass  of  the  image  texel.  Then  the  following 
relation  holds 

^  1  -  Ap  -  Bq 

'  ~  t?  Vl  +  p2  +  q 2 

S  yy 

The  quantity  — is  called  textural  albedo,  and  in  [27] 

methods  are  presented  for  its  computation.  For  our  pur¬ 
poses  we  will  assume  that  the  textural  albedo  is  known 
and  we  will  symbolize  it  by  X.  So,  in  the  case  of  shape 
from  pattern,  the  constraint  that  governs  the  problem  is 

s,  =  x  \ 

\/l  +  p2  +  q2 

It  is  important  to  note  that  the  constraint  in  this  case  is 
again  a  conic  section  in  gradient  space. 


where  a,  =  (0,  dy),  =  (dx,  0) 

Jj  —  |A«(z.y  +  dy)  -  Au(z.y),  dy  +  &v(z,y)  *  dy)  -  Ai<(z.y)) 
and  =  (  it  *  +  ix.  y)  -  Ati(i.y),  Av(z  +  dz,  y)  -  Av(z.y)) 

The  proof  of  the  above  equation  is  immediate  from 
the  application  of  the  lemma  to  the  points  (x,y), 
(x  +  dx,  y)  and  (x,y  +  dy).  It  has  to  be  emphasized  that 

the  coefficients  kj,  i  =  1 . 4  are  not  constant;  they 

are  functions  of  (x,y)  and  so  we  write  A,(x.y)  instead  of 

k,. 

3.3.  Computational  Theory  of  Shape  from  Pat¬ 
terns 

Imagine  that  we  are  viewing  a  surface  covered  with 
repeated  elements  (texels  or  patterns).  Then,  the  problem 
of  shape  from  patterns  is  solving  for  the  shape  of  the  sur¬ 
face  in  view  from  the  apparent  distortion  of  the  patterns. 
Several  researchers  have  worked  on  this  problem. 
Research  reported  in  [31,  32,  33]  made  the  assumption 
that  the  patterns  were  all  identical  and  symmetric,  and 


3.4.  Computational  Theory  of  Shape  from  Tex¬ 
ture 

This  case  is  very  similar  to  the  previous  one.  Here  we 
only  consider  a  dot  texture,  i.e.  we  assume  that  the  sur¬ 
face  in  view  is  co  ered  by  dots.  Shape  cannot  be 
recovered  in  this  case  from  a  monocular  static  view  of  the 
surface,  unless  some  assumptions  about  the  texture  are 
made.  The  assumption  of  uniform  density  (or  slight  varia¬ 
tions  of  it)  has  been  shown  to  be  successful  in  past 
research,  for  the  case  of  planar  surfaces.  Here  we  assume 
a  nonplanar  surface,  and  we  make  the  assumption  that  a 
unit  area  on  the  surface  in  view  contains  the  same 
number  of  texels  (points).  In  this  case,  the  constraint 
relating  local  shape  and  density  of  the  texture  can  easily 
be  established  and  it  is  of  a  form  similar  to  the  previous 
case,  but  instead  of  areas,  we  have  texture  densities.  We 
won’t  analyze  this  case  in  detail,  since  it  follows  easily 
from  the  previous  section. 

3.5.  Summary 

Up  to  this  point  we  have  analyzed  the  computational 
theory  behind  the  processes  of  shape  from  shading, 


864 


motion,  patterns  and  texture.  In  all  eases,  it  was  shown 
that  the  constraint  is  of  the  form 

* (*.»))>’+  B{x.y)q,+  C{x ,jr)pt  +D(z,y)p  +  E(z,y)q  +  F(x,y )  =  0  (13) 

where  A,B,C,D,E,F  are  functions  of  the  position  in  the 
image  and  depend  on  the  particular  physical  parameters 
(data).  It  has  to  be  noted  that  the  approximations  made 
(paraperspective  projection,  orthographic  projection)  are 
not  essential  for  our  theory.  If  perspective  projection  were 
used  then  a  more  complicated  constraint  (still  nonlinear 
of  higher  degree)  would  appear.  But  as  will  become  evi¬ 
dent  later,  the  form  of  the  constraint  is  not  very  essen¬ 
tial,  as  long  as  it  satisfies  some  weak  constraint  that  will 
be  described  later. 

3.6.  Shape  Representation:  Stereographic  Projec¬ 
tion 

Surface  orientation  is  quantified  by  the  surface  nor¬ 
mal,  a  unit  vector  in  R3.  Let  (/,m,n)  be  a  vector  denot¬ 
ing  the  direction  of  a  surface  normal.  Then  this  direction 

is  defined  by  (p.q  )=(-—,-  — | .  This  quantitv  is 
I  n  n  I 

called  the  gradient  of  the  normal  and  the  space  of  all  the 

possible  gradients  defines  the  so-called  gradient  space. 
Gradient  space  is  unbounded.  But  the  gradient  is  not  the 
only  representation  for  surface  orientation.  Another 
representation  that  results  in  a  bounded  space  is  stereo- 
graphic  coordinates.  A  surface  normal  can  be  represented 
by  a  point  on  a  unit  sphere,  called  the  Gaussian  sphere. 
The  part  of  the  surface  facing  us  corresponds  to  one  hem¬ 
isphere,  let  us  define  it  to  be  the  northern  hemisphere, 
while  points  on  the  occluding  boundaries  correspond  to 
the  points  on  the  equator.  A  point  n  on  the  Gaussian 
sphere  corresponds  to  a  unique  gradient  ( p.q )  and  vice 
versa.  This  can  be  seen  geometrically  in  the  following 
way.  Consider  the  projection  of  the  Gaussian  sphere  by 
rays  from  its  center  onto  a  plane  tangent  to  the  north 
pole.  This  plane  represents  the  gradient  space  (p,q). 
This  projection  is  called  the  gnomonic  projection  and  it 
has  the  property  that  great  circles  are  mapped  into  lines. 
The  north  pole  is  mapped  onto  the  origin  of  the  gradient 
space,  the  northern  hemisphere  is  mapped  into  the  whole 
p  -  q  plane  and  points  on  the  equator  end  up  at  infinity. 
Points  on  the  lower  hemisphere  are  not  projected  onto 
the  p  -  q  plane.  With  the  gradient  space  formalism,  the 
equator  maps  into  infinity.  As  a  result,  occluding  boun¬ 
dary  information  cannot  be  expressed  in  gradient  space. 
One  solution  to  this  problem  is  the  use  of  stereographic 
projection.  We  can  think  of  this  projection  in  geometric 
terms  also  as  a  projection  of  the  Gaussian  sphere  onto  a 
plane  tangent  to  the  north  pole.  However,  this  time  the 
center  of  projection  is  the  south  pole,  not  the  center  of 
the  sphere.  We  label  the  axes  of  the  stereographic  plane 
as  f,g.  It  can  be  shown  that  the  relation  between  the 
stereographic  space  coordinates  f,g  and  the  gradient 
space  coordinates  is  given  by 

/  =  2p  \y  1  +  p2  +  q2  -  l]  /(p2  +  q2) 


g  =  2q  j^v  1  +  p2  +  q2  -  lj/(p2  +  ?2) 

p=  « 

4-/2-<r 

q  =  - 12 - . 

4  -f2-g2 

This  mapping  is  conformal  and  the  whole  northern  hemi¬ 
sphere  is  mapped  onto  a  closed  disc  of  radius  2  on  the 

/  -  g  plane.  The  orientations  of  occluding  boundaries 
correspond  to  points  on  the  circumference  of  that  disc. 
The  advantage  of  the  stereographic  projection  is  that  it 
maps  the  whole  northern  hemisphere  (visible  orientations) 
onto  a  bounded  space,  i.e.  a  disc  of  radius  2  in  the  stereo¬ 
graphic  plan  with  center  at  the  origin  (the  point  where 
the  stereographic  plane  is  tangent  to  the  sphere,  i.e.  the 
north  pole),  and  occluding  boundary  information  can  be 
easily  expressed  (the  points  of  the  circumference  of  the 
disc). 

4.  ALGORITHMIC  THEORY 
OF  SHAPE  COMPUTATION 

It  is  by  now  clear  that  in  all  the  above  cases  (shad¬ 
ing,  texture,  motion  and  pattern)  the  problem  cannot  be 
solved  uniquely,  unless  an  additional  assumption  is 
imposed.  In  this  section  we  show  that  if  we  regularize 
these  problems,  i.e.  if  we  impose  an  additional  constraint 
(which  is  physically  plausible),  then  we  can  uniquely  solve 
these  problems.  If  we  use  the  equations  that  give  the 
stereographic  coordinates  in  terms  of  the  gradient,  we  can 
substitute  p.q  in  terms  of  /  and  g  and  assuming  that  the 
surface  does  not  contain  points  whose  orientation  is  equal 
to  the  orientation  at  the  boundaries,  we  get  a  relation  of 
the  form 

L(f,g.  x,y)  =  0.  (14) 

with  L  a  polynomial  in  f.g  whose  coefficients  depend  on 
the  image  data.  It  has  to  be  noted  that  in  all  the  prob¬ 
lems  of  shape  from  shading,  texture,  patterns  or  motion, 
the  constraint  will  be  of  the  above  form. 

Clearly,  all  the  above-mentioned  problems  of  finding 
structure  f.g  from  equation  (14)  are  ill-posed  problems  in 
the  sense  of  Hadamard  [36],  since  the  solution  is  not 
unique.  To  regularize  the  problem  and  make  it  well- 
posed.  we  should  introduce  some  more  constraints  on  the 
problem.  Specific  regularization  methods  for  solving  ill- 
posed  problems  have  been  developed  [16,  24].  Following 
the  paradigm  of  regularization  theory,  we  require  that  the 
surface  in  view  be  smooth  and  we  wish  to  minimize  the 
quantity  [37] 

<  =  /i|  +  (A)'  +  +  (O2]  *  Mb  )2Jrfr  iy  (15) 

where  X  is  a  regularization  constant. 

In  other  words,  to  solve  this  ill-posed  problem  (i.e.  to 
make  it  well-posed),  we  should  restrict  the  class  of  admis¬ 
sible  solutions  by  introducing  suitable  a  prion  knowledge. 
This  a  priori  knowledge  can  be  exploited,  for  example. 


865 


under  the  form  of  variational  principles  that  impose  con¬ 
straints  on  the  possible  solutions.  By  choosing  to  minim¬ 
ize  e  in  (15)  we  wish  to  find  a  solution  that  best  satisfies 
the  constraint  L  =  0  and  that  is  smooth,  where  the  rela¬ 
tive  degree  of  smoothness  is  measured  by  the  first  term  in 
(15).  But  this  problem,  even  though  it  fits  in  the  regulari¬ 
zation  paradigm,  cannot  be  trivially  solved,  because  of 
the  nonlinearity  involved.  Here  we  present  an  algorithm 
that  obtains  a  unique  solution. 

We  require  that  the  solution  be  a  surface  that  best 
satisfies  the  motion  constraint  (eq.  14)  and  at  the  same 
time  is  as  smooth  as  possible,  according  to  the  stabilizing 
functional  [24]  introduced  in  (15).  This  fits  exactly  into 
the  regularization  paradigm  in  vision,  which  originated  in 
[25]  and  has  received  attention  lately. 

The  question  that  arises  is  whether  or  not  there 
exists  a  unique  surface  that  minimizes  e ,  and  if  so,  to 
give  an  algorithm  that  finds  this  unique  solution.  In  the 
sequel  we  investigate  this  question.  Our  analysis  is  done 
for  the  discrete  case,  i.e.  by  taking  into  account  the 
discrete  nature  of  images. 


Function  e  of  equation  (15)  is  defined  on  a  compact 
subset  K  of  R2n,  with  n  =  k2,  and  it  is  continuous  with 
respect  to  /, ;  and  gtJ .  Therefore,  there  exists  a  solution 
to  the  minimization  problem.  Furthermore,  the  solution 
that  minimizes  e  ,  is  the  solution  of  the  system 


de 

d/ij 


de 

d9>J 


=  0. 


(17) 


Equations  (17)  become 


where 


/»  {  .  t  \  +  /.j+lUij  +  l)  +  fi-iAti- lj)  +  Ui-AliJ-l) 

B.j\9ij) - - - 

Equations  (18)  can  be  written  as 

=  -  \m2<f>( £)  where  (19) 

e=[/i.i . A.* . /*,*.?  1,1 . gkk\T 


4.1.  Finding  the  Unique  Solution 

Here  we  use  help  from  widely  known  techniques  in 
the  area  of  partial  differential  equations,  which  have  been 
applied  to  some  computer  vision  research  [29].  Consider 
the  constraint  equation  L  (f,g,  x,y)  =  0,  (x,y)  €  D ,  where 
D  is  the  unit  square  region  in  the  x  -  y  plane  with  mesh 
size  m .  and  discretize  e  by  using  difference  operators 
instead  of  differential  operators  and  summations  instead 
of  integrals.  We  assume  that  the  boundary  of  the  object 
(image)  is  a  square.  This  is  done  for  simplicity  here  and 
without  loss  of  generality.  Generalization  to  any  kind  of 
boundary  is  possible,  but  tedious  and  will  be  reported 

■  isewhere.  Let  n  =  k~,  where  k  +  1  =  — .  The  desired 

m 

surface  is  the  one  that  minimizes 


e  =  EK;  +  K j)  (i6) 

t.J 

where 


S’ 


jj 


2  |  1^*  "*-1  j  fij  l2  +  [/■’,/ +1  “  J\.jY  + 

+  [ffi  +  lj  -  ?,;f  +  kj+,  -  J,j]2J 


ki  =  9ij,  )]  2 

and  where  /l;  ,  <7i;  represent  the  surface  orientation  at  the 
regular  grid  point  ( im,jm ).  This  minimization  is  subject 
to  boundary  conditions,  i.e.  f(j  and  gt]  are  known  if 
( imjm )  belong  to  the  boundary.  We  assume  that  the 
surface  normal  at  a  boundary  point  (»',>)  is  parallel  to  the 
image  plane  (i.e.  /(*  +  g-2-  =  4).  (Occluding  boundary). 

If  perspective  projection  is  used,  then  the  occluding 
boundary  surface  normals  are  not  parallel  to  the  image 
plane,  i.e.  /2  +  g2^ 0,  but  they  are  computable.  The 
boundary  surface  normals  have  only  to  be  known,  they 
don't  have  to  have  any  particular  value. 


[  '  ’  Qi  ji  hi  ) } 

. {£(/.,;  .  9ij ,  id)  } 


.  ■ , ,  g,j ,  i,j ) 


df 

dL  ( fij  >  9>,j  ’  bJ  ) 
dg 


and 


= 


A  0 
0  A 


where  A  £  R"  x  *  and 


B  -I 
-l  B  -I 


and 


-I  B  I 
-I  B 


4  -1 
-1  4  -1 


•  ■  -1  4  1 
-1  4 


€  R*xt 


Equation  (19)  is  nothing  but  equations  (18)  written  in  a 
compact  form.  Notice  that  equation  (19)  is  the  necessary 
condition  for  the  solution  that  minimizes  (16).  We  will 
now  prove  that  equation  (19)  has  a  unique  solution.  For 
this  we  will  need  the  fact  that  the  functions 

\L(f,9,iJ)}  dLi'!f;  ''j)  and  [L(/,g,  i,j)}  dL^  are 
Lipschitz  with  respect  to  /  and  g. 

Let  and  ^wo  solutions  of  equation  (19),  with 
£i  7^  $2-  Then 

<Hi  =-XmV(f,)  (20) 


866 


4*2  = 

=  -Xm2^). 

(21) 

But  $  is  invertible  (29j.  So,  (20)  and  (21)  become: 

£.  = 

-Xm2*-V(C,) 

(22) 

&-■ 

-Xm2*-1#*,). 

(23) 

From  (22)  and  (23)  we  get 

€l-€a--X«**ri(^(Cl)_ 

#€2))  - 

or  IK,  -  Calls  <  Xm2||*-‘||2 

Mti) 

-  #C2)I|2 

(24) 

But  ||*-‘||,= 

2**ms  j  1 

-^1]" 

(25) 

from  [30]. 

Also,  since  j 

-  and 

( £(/.?>  >j  ) 

I  dL 
i  dg 

are 

Lipschitz  with  respect  to  /  and  g ,  we  have 

{i(/,*  «j)j 

dL(f,g,iJ)  | 
df  1 

[  £(/'.* 

Ml 

(26) 

<  c1J{(/-/')2  +  (j  -?')2}‘/2 


dL 


“ft*1  -  (m/V.  w)  I 


amw,  id) 


(27)' 


<  <;{(/-/')2  +  (5  -??},/2 

and  /i  =  maxlc.j,  <f0}  (28) 

From  (26),  (27),  and  (28)  we  get 

IW€i)  -  *(€2)ll2  <  HKi  -  Calls  (28) 

Equation  (24)  from  (25)  and  (29)  becomes 

IICi- Calls  <  Xms^*3ma|l  -  j  HlCi  -  Cslla  (30) 


If  we  choose  X  such  that 

Xm2^27r2m2(l  -  j^m2^)2]  n  <  I, 

then  equation  (30)  leads  to  a  contradiction.  So  £,  =  £2 
and  equation  (19)  has  a  unique  solution.  Furthermore, 
the  sequence  defined  by 

C<a+1)  =  -Xm24>-V(CH  a  —  0,  1,  2, 


converges  to  the  unique  solution  of  equation  (19).  The 
proof  proceeds  as  previously,  by  proving  that  if  £  is  the 
solution,  then  -  £||2  <  g||C<>)  -  CII2  where  q  <  1. 


So,  we  have  proved  the  following  theorem. 

Theorem: 

If  0  <  X  <  27r2p~*[l  -  jr2m2/24]2,  then  there  exists  a 
unique  surface  minimizing  e  (equation  16),  which  is  also 
the  unique  solution  of  equation  (19)  and  the  above 
described  algorithm  converges  to  that  solution. 


To  complete  the  proof  we  need  to  show  that  (26) 
and  (27)  hold.  Observe  that  L(f,g,i,j)  is  a  polynomial  in 

f,g.  Consequently,  L—~  and  L-r—  are  also  polynomials 
dj  dg 

for  every  t,j .  So,  we  can  take  the  Lipschitz  constant  cJ} 
for  (26)  to  be  such  that 


(31)  and  (32)  also  mean  that  cl;  and  exist  and 
are  finite  for  every  ij.  Finiteness  is  also  clear  since  the 
suprema  are  taken  over  the  disc  of  radius  2  (compact  set) 
and  the  functions  are  polynomials.  In  this  context,  if  we 
replace  c,j  and  <f,y  by  their  respective  upper  bounds  of 
(31)  and  (32),  it  is  possible  that  the  resulting  value  of  ft  is 
larger  than  the  actual  one.  However,  this  is  immaterial 
for  the  uniqueness  proof,  since  it  only  results  in  a  smaller 
range  for  X. 


4.2.  Summary 

Up  to  now  we  have  established  the  fact  that  from 
the  knowledge  of  the  surface  normals  at  the  boundaries, 
the  constraint  of  the  particular  problem  and  the  smooth¬ 
ness  condition,  we  can  uniquely  recover  the  surface  in 
view,  and  we  have  given  an  algorithm  for  this.  The  object 
in  view  had  to  have  a  square  boundary  but  a  generaliza¬ 
tion  to  any  kind  of  boundary  is  possible.  It  has  to  be 
noted,  though,  that  there  are  two  important  shortcomings 
in  our  theory  up  to  now.  The  first  has  to  do  with  the 
choice  of  the  regularization  parameter  X;  X  can  be  chosen 
from  an  interval  of  values,  but  the  solution  we  get  might 
not  be  physically  plausible.  The  second  issue  has  to  do 
with  the  fact  that  we  used  a  specific  smoothness  condi¬ 
tion,  in  particular  we  measured  departure  from  smooth¬ 
ness  with  the  functional 


// (A2  +  V  +  9Z2  +  9V2)  dxdy 

The  choice  of  this  functional  is  clearly  ad-hoc.  The  ques¬ 
tion  that  arises  then  is:  can  we  build  our  theory  in  such  a 
way  so  that  we  can  learn  the  parameters  that  are 
involved  in  our  problem  and  at  the  same  time  avoid  any 
explicit  smoothness  constraints? 

If  we  do  not  assume  any  particular  variational  condi¬ 
tion  for  smoothness,  as  long  as  the  condition  is  quadratic 
and  if  in  equation  (19)  we  hide  the  regularization  parame¬ 
ter  X  in  matrix  <f>,  then  from  the  minimization,  we  will 
always  get  an  equation  of  the  form 

n  =  m 

where  £  is  the  shape  vector,  1.  e.  the  surface  normals  in 
the  image,  <l>  a  matrix  and  <p  a  function  of  the  shape  that 
depends  on  the  particular  problem  that  we  are  analyzing. 
This  is  a  nonlinear  equation,  for  which  we  have  proved 
that  if  some  weak  conditions  on  <t>  are  satisfied  and  if 
matrix  $  is  such  that  ||$~MI2'M<1,  where  n  the  max- 


,1 


867 


imum  of  the  Lipschitz  constants  of  the  functions 

and  then  the  iterative  formula  discussed  in  Section 

dg 

-1.1  will  converge  to  a  unique  solution. 


So,  what  we  need  to  do  is  to  present  a  way  in  which 
a  neural  net  can  learn  matrix  <h  from  examples,  in  such  a 
way  that  uniqueness,  robustness  and  convergence  are 
guaranteed.  This,  is  discussed  in  the  next  section. 


5.  IMPLEMENTATION  OF 
SHAPE  COMPUTATION  AND  LEARNING 


Techniques  for  learning  the  solutions  to  equations 
have  been  developed  by  researchers  in  the  stochastic 
approximation  [1,2],  statistics  [3,  7],  mathematical  learn¬ 
ing  theory  [4,  5],  computer  science  [6],  and  electrical 
engineering  communities  [8,9,  lOj.  Many  of  these  tech¬ 
niques  apply  to  linear  as  well  as  nonlinear  equations. 
There  are  special  characteristics  that  low-level  vision 
problems  have  that  may  make  it  possible  for  us  to 
accelerate  the  convergence  of  these  techniques.  This  is 
because  we  sometimes  have  a  prion  knowledge  that  the 
solution  must  lie  in  a  restricted  subset  of  all  possible  solu¬ 
tions.  We  may  also  be  obliged  to  find  the  solution  within 
a  constricted  subspace,  if  we  are  to  represent  the  solution 
by  a  neural  network  of  limited  size  with  a  limited  number 
of  connections,  which  may  not  be  able  to  represent  an 
arbitrary  function  or  even  an  arbitrary  matrix.  Tech¬ 
niques  have  been  developed  for  finding  the  best  solution 
within  a  restricted  subspace. 

First,  we  must  discuss  what  exactly  we  want  to  learn 
and  why.  What  does  it  mean  to  learn  a  solution  to 

=  m  (33) 

when  there  may  be  not  a  unique  £  that  satisfies  (33)? 
What  does  it  mean  to  have  learned  <1>  from  examples 
when  the  examples  might  not  suffice  to  determine  <l>  or 
when  they  give  contradictory  evidence  as  to  what  <J>  is? 

The  4>  we  choose  is  the  least  squares  solution.  Under 
reasonable  and  quite  general  statistical  assumptions,  the 
least  squares  solution  is  the  most  probable  value  for  <t>. 
Computing  the  least  squares  solution  involves  calculating 
the  Moore-Penrose  inverse.  In  the  following  sections,  we 
describe  a  recursive  procedure  for  computing  this  inverse. 
This  procedure  is  optimal  in  the  sense  that  it  produces  as 
accurate  an  estimate  of  <t>  as  we  can  obtain  from  our  lim¬ 
ited  set  of  examples,  but  it  requires  our  neural  net  to 
store  large  matrices  and  make  very  complex  calculations. 
So  instead  of  searching  for  the  optimal  solution  of  (33) 
within  a  full  n2  dimensional  vector  space  (n  is  the  dimen¬ 
sion  of  the  shape  vector),  we  calculate  the  best  solution 
within  a  restricted  subspace.  The  procedure  for  calculat¬ 
ing  this  restricted  subspace  solution  is  very  similar  to 
that  for  calculating  the  full  space  solution  but  the  compu¬ 
tational  complexity  is  much  less  and  the  speed  of  conver¬ 
gence  is  much  greater.  Another  technique  we  discuss  is 
the  Robins-Monro  procedure.  This  results  in  slower  con¬ 
vergence  but  the  calculations  are  very  easy  to  implement 
on  a  neural  net.  Whatever  procedure  we  use,  it  may  take 


too  many  examples  to  learn  <!>  .  We  can  use  a  priori 
knowledge  to  accelerate  convergence.  For  example,  we 
might  know  that  4>  varies  smoothly  as  a  function  of  posi¬ 
tion.  The  <t>  we  use  is  partly  determined  by  examples  and 
partly  by  a  priori  knowledge.  As  we  acquire  more  exam¬ 
ples,  the  influence  of  a  priori  knowledge  on  our  estimate 
of  <i>  decreases. 

Before  discussing  these  computations,  let  us  review 
what  learning  (rather  than  stipulating  a  priori)  the  value 
of  4>  will  accomplish  for  us.  $  hides  the  regularization 
parameter  X.  This  parameter  might  be  allowed  to  vary 

from  place  to  place.  (We  might  require  a  different 
amount  of  smoothing  near  the  boundary  than  at  the 
center  of  the  visual  field.  )  Our  smoothing  condition 
might  involve  first,  second,  or  higher-order  derivatives  or 
some  linear  combination  of  derivatives  of  different  orders. 
By  varying  4>,  we  can  take  into  account  all  these  possibil¬ 
ities.  We  do  not  know  which  of  these  possibilities  is 
correct. 

That  is  why  we  want  a  general  learning  theory  to  help 
us  find  the  proper  4>  (to  find  the  least  squares  solution  to 
(33)). 


5.1.  The  Generalized  Inverse 

As  far  as  finding  4>  is  concerned,  it  is  irrelevant  that 
4>  (£)  is  a  function  of  £.  We  just  treat  (33)  as  a  linear 
equation  to  be  solved  for  4>.  We  are  given  a  set  of  learn¬ 
ing  examples  (£,  ,  <£(£,  ))  and  we  want  to  solve  the  system 
of  equations 

*  e,  =  m)  (34) 

Even  if  a  teacher  tells  us  the  exact  value  of  the 
shape  vector,  £,,  and  of  the  image  data  that  determine 
),  we  cannot  expect  (34)  to  be  a  strict  equality 
because  the  model  we  used  to  determine  <l>  was  an 
oversimplification.  We  cannot  possibly  take  all  factors 
into  account.  If  the  error  is  a  zero-mean,  normally  distri¬ 
buted  variable,  it  makes  sense  to  seek  a  solution  in  the 
sense  of  least  squares: 

minimize  £||*  f,- -  0(f,-)lla2  (35) 

i 

Assuming  normality  of  the  error,  then  the  least-squares 
solution  to  (59)  is  the  most  likely  value  for  4>.  Even  if  the 
error  4>£,  -  <£(£,  )  is  not  Gaussian  we  can  under  reason¬ 
able  assumptions  obtain  a  central  limit  theorem  which 
says  that  if  there  are  a  large  enough  number  of  examples 
the  least  square  solution  will  be  the  best. 

In  case  there  are  many  least-squares  solutions,  choose 
the  <J>  of  minimum  norm,  | [4*1 12-  This  "dH  help  to  make  4> 
sparse.  The  minimum-norm,  least  squares  solution  to  AX 
=  I  is  called  the  Moore-Penrose  pseudo-inverse,  A+  ■ 

This  inverse  shares  many  properties  with  the  ordi¬ 
nary  inverse,  for  example: 

A*  A  and  A  A  +  are  symmetric, 

A*  A  A+  =  A+  ;  A  A  +  A  =  A. 

This  inverse  can  be  used  to  solve  linear  equations.  The 


868 


|  least  squares  solution  to  AXB  =  C  is  X  = 

2 :  A  +  C  B+  +  M(I  -  i4+  A  MB  f?+)  where  M  is  an  arbi- 
|  trary  matrix  of  appropriate  dimensions  [7].  Other  gen- 
eralized  inverses  have  been  defined  [12],  but  if 

v  //+  II  =  /  ,  then  H+  Z  is  the  only  least  squares  solution 
to  HX  =  Z.  /f+  H  =  /  if  and  only  if  zero  is  the  only 
null  vector  of  H. 

Let  us  apply  this  insight  to  (34).  First  note  that  if 
<!>,,  ...$„  represent  the  rows  of  <J>,  (34)  is  really  a  set  of 
independent  equations: 

for  alii,  ♦*&«**(&)  (36) 

where  <>*(£,)  is  the  kth  element  of  the  vector  $*(£,). 
For  convenience,  we  can  write  this  as 

&T**r  =  **(£,)• 

This  has  a  unique  least  squares  solution  if  the  rank  of  the 
matrix  of  examples 

(£j,  ....  £j)  is  at  least  n.  So  there  should  be  at  least  n 
independant  examples. 

There  are  other  reasons  we  will  need  at  least  n 
examples.  We  want  our  learning  procedure  to  be  robust. 
However,  Aj  ->  A  does  not  imply  Aj+  ->  A+  unless 
lim(ranJfcAJ)  =  rank(A)  as  j  approaches  infinity.  We  can 
guarantee  this  as  long  as  we  know  we  are  only  dealing 
with  matrices  of  rank  n.  There  are  many  methods  for 
calculating  generalized  inverses.  They  all  implicitly  or 
explicitly  make  a  decision  about  the  rank  of  the  matrix 
being  inverted.  Furthermore,  we  should  remember  we  do 
not  have  an  intrinsic  need  to  know  <t>.  We  want  to  be 
able  to  learn  <J>  so  that  given  new  image  data,  we  can 
iteratively  solve  the  equation 

£  =  4>+  m  (37) 

for  £.  We  learn  4>  in  order  to  obtain  an  accurate  value  for 
£.  Temporarily  ignore  the  dependence  of  <i(£)  on  £  and 
pretend  the  image  data  (or  a  teacher)  told  us  the  value  of 
<£(£)•  We  want  an  estimate  of  £  that  is  linear  in  0(£),  that 
is  unbiased,  and  that  has  minimum  variance  (a  Best 
Linear  I'nbiased  Estimate  or  BLUE).  Whether  (37)  is  a 
BLUE  for  £  depends  on  the  error  covariance  matrix 

£((**£ I  -  ^t(Ci))  '  (♦<£>  ~  <t>tUj)))  where  E  stands  for 
expected  value.  However,  if  there  are  least  n  independent 
examples,  we  know  that  (37)  is  the  statistically  correct 
BLUE  estimate. 

5.2.  Computing  the  Pseudo  Inverse 

Many  techniques  exist  for  computing  pseudo¬ 
inverses;  not  all  of  them  are  suited  to  the  case  where  the 
data  arrive  in  a  stream  (  a  temporal  succession  of  exam¬ 
ples.)  One  method  that  is  suitable  is  Greville's  (7,  12]. 
The  expression  we  want  to  compute  is  of  the  form 

**/  -  H,t+  <t>k(Ht)  (38) 

where  is  our  guess  of  row  <I>t  after  having  seen  s 
examples.  H,  is  the  matrix  formed  from  the  first  s  exam- 
.  pies  of  shape  vectors  (£,  ,  {,)  and  4>k  (H, )  is  the 

i  vector 


[*k  (Cl)  *k  (£2) '  h  (£jr 

Initialize  by  using  the  estimate  $k0T  =  0.  Then  we  can 
use  the  recurrence  equations 

*k(s  +1)T  =  $k,  T  +  +  1  error  (39) 

where  error  is  the  error 


£,  +  1  **«  T  -**(£.  +  ,) 


(40) 


obtained  by  using  the  sth  estimate,  4>tj,  on  the  s  +  1  th 
example.  The  new  estimate  is  obtained  by  adding  to  the 
old  estimate  a  term  proportional  to  the  error.  The  gain 
matrix,  Kn  +  ,  is  given  by: 

(/  -  //,+//,  )£/+1 


K,+l  = 


£«+i(/  -  ff/ff.)£/+l 

if  (I  -  W/tfj£a  +  1^0  and 


1  +  £<+1/V///+£/+1 


(41) 


(42) 


otherwise. 


(41)  and  (42)  give  the  correct  least  squares  solution, 
but  depending  on  the  hardware  one  has  available,  com¬ 
puting  these  equations  on  one’s  neural  net  may  be  a 
problem  if  one  has  to  store  the  entire  matrix  of  examples. 
IIS.  The  shape  vectors  are  each  of  size  n  and  one  may 
have  to  store  s  of  these,  s  might  be  much  larger  than  n, 
and  n  itself  can  be  large  if  one’s  grid  is  very  fine  and  one 
has  to  store  two  parameters,  f  and  g,  for  each  point  on 
the  grid.  There  are  several  ways  to  deal  with  this  prob¬ 
lem.  One  is  to  recursively  compute  2  auxiliary  matrices, 
At  —  I  -  Hs  +  II,  and  B,  =  (H,  T  H,)+.  The  matrix 
K*  +1  can  he  written  in  terms  of  .4„  +  ,  and  B,  ,. 
These  matrices  are  only  of  size  n  by  n . 


The  recursive  computations  are  [7]: 

M.£.r+iXA.£/+1)T 


A 


3  -+■  1 


=  A.  - 


£j  +  1^8  £j  +1 


(43) 


if  xs  +  j  is  independent  of  the  previous  examples 
£1 . £s»  and 


^8  +  1  -  ^8 

otherwise.  Similarly,  we  get  equation  (44)  below: 

(B.£.r+,)M.e.7;i)r  +  (.4.f.r,,)(s.e.7;1)3' 


*  1  — 


(£.+A£.r+,)2  (  ,e*+l)(  ,s,+l) 


if  £J+|  is  independent  of  the  previous  examples  and 

£8  £/+.(£,  £/+>)r 


®s  +  l  — 


1  +  UxBAs 


otherwise. 


We  may  still  have  an  implementation  problem 
because  the  two  matrices  A,  and  B,  are  of  size  n  by  n 
which  is  fairly  large  and  the  recursive  computations  (43) 
and  (14)  are  fairly  complex.  When  we  have  available  to 


us  an  arbitrary  number  of  connections  between  an  arbi¬ 
trary  number  of  nodes,  there  is  no  problem  in  computing 
these  recursions  in  parallel  in  a  small  amount  of  time,  but 
our  resources  may  be  limited.  It  is  true  that  <I>  is  of  size 
n  by  n  and  we  do  have  to  store  $.  But  we  usually  have  a 
priori  knowledge  that  4>  is  sparse.  We  can  take  advan¬ 
tage  of  this  fact  to  simplify  our  computation. 

We  may  not  want  to  do  that.  One  thing  that  does 
help  us  is  that  Kt  does  not  depend  on  the  data  terms 
$(£,).  So  K,  does  not  depend  on  which  particular  "shape 
from”  method  we  are  using.  If  we  know  in  advance  what 
the  set  of  example  shape  vectors  (£,  £,)  will  be,  we 

can  precompute  the  K ,  and  we  might  be  able  to  choose 
our  example  shape  vectors  to  simplify  this  calculation. 
Still,  storing  Kt  might  be  a  problem. 

We  can  also  restrict  the  solution  space  of  possible 
values  of  4>  to  those  satisfying  some  additional  constraint 

G*<K*r=t/i  (45) 

for  all  k.  Or  more  generally 

SG*<t>*r  =  C/  (46) 


rank  D  and  we  confine  our  attention  to  a  D-dimensional 
subspace.  The  matrices,  A,  and  B,  are  only  of  size  D2. 

Another  approach  is  to  change  our  optimality  cri¬ 
terion  (4).  Instead  of  learning  the  least  squares  solution 
to  the  whole  system,  we  learn  the  $  that  produces  the 
least  error  on  the  last  r  examples.  Then  we  do  not  need 
to  compute  H,+Ht,  but  only  the  matrix  H,T+  HtT  where 
H,rT  =  (f,  +  i  -r.  ■••.£,+  i)-  In  the  limit  as  s 
-> infinity,  r  >  1,  the  solution  using  this  criterion  will 
converge  if  we  assume  stationarity  (the  same  is  valid 
for  all  times.  Thus,  there  exists  a  such  that  the  model 
<I>£  —  P*us  noise  holds  for  all  examples  where  the 
noise  is  independent  of  £  and  the  noise  for  the  different 
examples  is  a  set  of  statistically  independent,  identically 
distributed  random  variables.  If  the  model  is  not  station¬ 
ary,  the  correct  <l>  changes  in  order  to  adopt  to  change  in 
the  world. 

In  the  case  r  =  1,  we  will  obtain  the  matrix 
fj  +i  (£«  +  ,r€.  +i)  '•  T°  insure  convergence,  we  change 

this  to  -j— ■  £s  +  ,(£,  +Ir(a  + 

s  +  1 

5.3.  Suboptimal  Gain  Sequences 


where  Gk,  Uk,  and  U  depend  on  the  constraint  we  want 
to  impose.  Learning  within  a  constricted  space  of  possi¬ 
ble  solutions  is  a  case  intermediate  between  the  situation 
where  we  arbitrarily  choose  a  particular  smoothness 
measure  in  an  entirely  adhoc  manner  and  the  situation 
where  we  search  the  whole  space  of  possible  4>’s.  We,  for 
simplicity,  will  only  deal  with  the  constraint  (45).  (46) 
can  be  handled  in  a  similar  manner.  Then  our  recursion 
takes  the  form. 


G  +  U 

(47) 

+  *?»  +  i(error) 

(48) 

where 

error  =  (<t>k{Sl+i)  -  ?,  +  i**I) 

(49) 

where 

'Mfa  +  l)  =  dkAt  +  l)  -  Z,+iGk+U 

(50) 

and  £<  +  , 

=  e/+1(/-c+c)r. 

*+1 

(51) 

if  £,  +  1  is  not  a  linear  combination  of  £,,...  ,  £,,  and 


otherwise,  where  A,  =  I  -  Be  =  (Ht  T H^)+  and 

W7=  //,(/  -  G  +  G). 

The  recursions  for  At  and  B,  are  the  same  as  those 
for  A,  and  B,  except  that  £4  is  replaced  by  £,. 

The  equations  (47)-(57),  are  the  same  as  those  in  the 
unconstrained  case  except  for  the  bars,  which  restrict  our 
attention  to  the  subspace  such  that  Ct($*)  =  Uk.  If  the 
null  space  of  Gk  has  dimension  D,  then  /  -  GkG  has 


The  methods  we  have  discussed  so  far  for  computing 
least  squares  solutions  involve  computation  of  complex 
expressions  in  order  to  determine  the  matrices  K „.  We 
can,  however,  work  with  much  simpler  matrices.  For 
example,  we  can  assume  the  matrix  Kt  is  of  the  form 

w 

K'  =  “As  (52) 

The  ft,  must  satisfy  the  conditions 

>  0  £<*.  =  oo  £«„2  <  oo  (53) 

The  condition  £aa  =  oo  insures  that  the  later  examples 
in  the  sequence  £,  have  sufficient  influence  on  the  matrix 
learned.  The  other  condition  causes  convergence. 

In  this  case,  we  have  convergence  with  probability 
one  to  the  correct  least  squares  solution.  The  conver¬ 
gence  rate  is  slower  than  optimal.  This  method  for 
finding  the  solution  to  equations  was  originally  developed 
by  Robbins  and  Monro  [13].  One  can  show  that  if  one 

chooses  a,  =  — ,  then  the  convergence  rate  is  asvinptoti- 
s 

cally  O(-i-)  [14]. 

The  Robbins-Monro  procedure  can  be  shown  to  con¬ 
verge  with  probability  one  to  the  least-squares  solution  to 
<t>£  =  <£(£)  under  assumptions  mue\  weaker  than  the 
requirement  that  the  errors  be  Gaussian.  For  example  we 
can  require  only  that  the  errors  have  identical,  indepen¬ 
dent  distributions  and  then  we  can  obtain  a  convergence 
proof.  Or  we  can  allow  the  errors  to  be  dependent  on  the 
state  $  because  we  really  do  not  have  the  equation 
<J)f  =  $(£)  +  error.  Ratner  we  have 

$(£  +  error,)  =  <£(£  +  error 2)  +  error 2  +  error 3  (54) 

where  error,  takes  into  account  the  fact  that  we  can 
never  know  £  with  total  precision.  It  also  allows  for  the 


870 


g'lct  that  $£=<>(£)  is  usually  obtained  by  maximum  likel- 
&  lood  reasoning.  The  {  we  find  is  the  one  that  best  fits 
I  -hat  is  most  likely)  given  the  data  but  occasionally 
nprobable  events  can  occur.  One  cannot  expect  the 
ctual  £  to  be  exactly  equal  to  the  most  probable  £. 
I  'rror 2  arises  because  we  do  not  know  the  data  and  hence 
;"e  do  not  know  4>  with  total  precision.  Error3  is  due  to 
he  factors  our  model  fails  to  take  into  account.  As  long 
s  we  satisfy  some  simple  conditions,  we  still  get  almost 
ure  convergence  to  the  least-squares  solution.  We  can 
imply  require  that  the  errors  in  the  different  examples  be 
ounded,  almost  surely  have  finite  variance,  be  identically 
istributed  and  independent.  These  conditions  are  still 
omewhat  restrictive.  There  are  extensions  to  cases  where 
he  errors  are  a  moving  average  of  exogenous  variables, 
’or  more  copmplete  discussions  of  when  the  Robbins- 
donro  procedure  converges,  see  (1,  10,  5].  Note  that  the 
mportant  assumption  of  bounded  errors  is  reasonable  if 
he  derivatives  of  <t>  are  bounded. 


Even  if  we  have  convergence,  we  need  not  have  con- 
ergence  to  the  correct  solution  if  our  sample  examples 
.re  not  representative  of  the  naturally  occuring  shapes. 
The  solution  will  instead  converge  to  the  least-squares 
olution  to  the  artificial  problem  presented  by  the  atypi- 
al  examples.  What  we  require  is  that  the  vector  spaces 
if  naturally  occuring  possible  examples  be  spanned 
nfinitely  often.  Thus,  for  every  m ,  the  rank  of  the 
natrix  (£m,£m>ll  .  )  must  be  n.  And  we  also 

equire  that  we  can  choose  a  function  /(f)  such  that  for 
ill  small  i  ,  the  shape  vectors  fy,),  ...  £yj,  ,  ,j  ,  span 

he  vector  space  and  Y]-  }.  ■  —  oc. 

/(») 


..4.  Acceleration  of  Convergence 

Even  if  we  use  the  optimal  formula  for  estimation  of 
he  least  squares  solution,  convergence  can  be  quite  slow., 
'he  variance  of  our  estimates  depends  on  the  trace  of  the 
,ain  matrix.  Thus,  our  rate  of  learning  is  inversely  pro- 
lortional  to  a  large  number  n. 

The  result  of  this  slow  convergence  is  that  our  esti- 
nate  <1>  may  not  be  as  sparse  as  it  should  be.  Another 
lossibilitv  is  that  some  of  the  values  of  <&  = 
night  be  misleading.  We  might  have  a  reasonably  accu¬ 
rate  estimate  of  each  4>,y  but  a  bad  estimate  of  the  high 
requency  components  such  as  4>,y  +  1  -  .  The  estimate 

>f  these  components  might  simply  reflect  noise.  This  may 
nean  that  although  we  can  compute  shape  accurately,  we 
:annot  accurately  compute  the  difference  in  orientation  of 
.wo  nearby  points. 

In  other  words,  we  still  have  to  deal  with  the  regu- 
arization  problem.  We  might  make  the  ad-hoc  assump¬ 
tion  that  (— ~ )2  +  (-^r-)2  is  small  except  at  those  ij 
,  dt  dt 

that  reflect  the  information  that  data  at  a  given  point  P 
gives  us  information  about  shape  in  the  immediate  vicin¬ 
ity  of  P.  So,  regarding  4>  as  a  continuous  function,  we 
will  have  a  condition  of  the  form 


minimize-^- £(*,>  -  *t>)2  +  x/j(^-)2 


where  <J>|;  is  the  value  given  by  the  recursive  least-squares 
estimation  of  <l>.  The  ad-hocness  here  is  not  as  bad  as  the 
ad-hocness  in  assuming  that  shape  is  smooth.  Here  we  are 
only  assuming  that  the  function  relating  the  shape  to  the 
data  is  smooth  almost  everywhere  (it  should  be  smooth  in 
the  region  Q  which  excludes  the  values  of  <I>,J  correspond¬ 
ing  to  information  that  the  data  at  point  P  provide 
about  the  shape  at  point  P).  Furthermore  as  we  learn 
more  and  more  examples,  we  can  let  X — >0.  So,  the 
influence  of  the  adhoc  smoothing  condition  lessens.  Of 
course,  we  could  have  chosen  some  other  smoothing  con¬ 
dition.  As  long  as  this  other  condition  is  quadratic,  we 
will  obtain  a  linear  constraint  in  4>.  In  general  we  will 
have  a  stabilizing  family  of  smoothing  conditions: 

minimize £](<!> ,y  -  )2  +  Ls  (<J>.0)  (56) 

where  the  L t  are  bilinear  forms  and  Ls — ►O  (see  Morozov 

[16]). 

Do  we  want  to  learn  Ls  ?  That  is  do  we  want  to  learn  the 
linear  condition  (56)  gives  rise  to?  The  set  of  linear 
transformations  from  Rn  to  Rn  has  dimension  n2  by  n~. 
We  do  not  really  want  to  learn  a  n2  by  n2  matrix,  unless 
we  can  a  priori  constrain  the  matrix  (the  matrix  in  the 
linear  condition  corresponding  to  (56))  to  a  small  sub¬ 
space.  The  only  reason  we  needed  the  additional  con¬ 
straint  (56)  is  that  we  need  a  large  number  of  examples  to 
accurately  learn  the  value  of  <{>  particularly  if  the  data 
are  noisy,  which  will  be  true  if  the  examples  are  given  by 
nature.  We  have  to  make  temporary  ad-hoc  assumptions 
somewhere.  It  is  not  clear  where  we  should  make  them. 
One  thing  though,  is  clear.  We  should  not  make  adhoc 
assumptions  of  smoothness  of  shape.  We  can  eventually 
learn  the  right  variational  condition.  It  is  safer  to  make 
the  assumption  in  (56)  because  eventually  Ls  —0  as  we 
accumulate  more  example. 

The  other  way  to  handle  slow  learning  of  <t>  is  to  res¬ 
trict  the  subspace  of  possible  solutions.  We  can  still  apply 
the  same  basic  Robbins-Monro  procedure.  Except  when 
computing  the  error  we  have  to  use  a  formula  like  (49), 
and  we  project  <1>S , ,  into  the  restricted  subspace  of  possi¬ 
ble  solutions.  We  can  also  restrict  the  solution  to  a 
closed  convex  subset  and  impose  inequality  constraints 
similar  to  the  one  that  guarantees  convergence,  namely 

(57) 

5.5.  Varying  Parameters  to  Obtain  Unique  Con¬ 
vergence 

After  we  have  learned  <f>,  our  task  is  not  necessarily 
done.  If  ||4>+||27i<l  (58),  where  n  is  the  maximum  of  the 
Lipschitz  constants,  we  have  a  unique  fixed  point  to 
£  =  <j>+<£>(£),  which  we  compute  iteratively.  If 
||<l>+||2  /i>  1  ,  it  is  not  clear  how  to  proceed,  because  con¬ 
vergence  is  not  guaranteed.  Of  course,  it  may  be  the 
case  that  there  is  no  unique  fixed  point.  Another  possibil¬ 
ity  is  that  <j>(£)  is  the  wrong  function  to  use  and  that  a 
mathematically  equivalent  formulation  will  satisfy  (84). 


For  example  if  ^(£)  =  ♦£  we  could  use  instead  the  con¬ 
straint 

(*  +  /)*  =  <t>u)  +  e 

Or  we  may  make  a  nonlinear  transformation.  For 

example,  in  one  dimension  a  constraint  of  the  form 
a-'1  -I-  ax3  +  bx2  +  cx  +  d  =  x  could  be  written  as 

_i_ 

x  =  (ax3  +  bx 2  +  cx  +  d  -  x)4 .  There  is  an  infinite 
dimensional  space  of  possible  ways  of  rewriting  the  con¬ 
straint  <t>£  =  0(£).lt  is  not  clear  how  to  systematically 
find  a  reformulation  for  which  (58)  is  true.  The  problem 
is  that  (58)  must  hold  regardless  of  the  data. 

One  way  to  get  around  this  problem  is  to  let  <i>  be  a 
function  of  the  data.  Letting  <l>  be  an  arbitrary  function 
of  the  data  will  not  work.  There  is  an  easy  trivial  solution 
to  <l>£  =  <£(£).  What  we  might  do  is  divide  the  space  of 
possible  data  into  a  small  number  of  regions  and  then  for 
each  region  learn  the  least-squares  solution  to  the 
linear  equation 

=  m 

when  the  data  is  in  /?, .  The  more  regions  Rit  we  have, 
the  longer  it  takes  to  learn  all  the  <!>, .  So,  we  might  stop 
subdividing  R,  as  soon  as  we  find  that  ||<J>1+||2/i<  1,  where 
the  Lipschitz  constant  p  is  only  evaluated  for  £ 
corresponding  to  data  in  /f,  .  We  can  as  before  impose  a 
regularization  condition,  requiring  smoothness  accross 
boundaries.  Future  research  is  needed  on  how  to  let  $ 
vary  with  the  data  (what  regions  to  choose  and  how  to 
find  mathematically  equivalent  formulations  of  (33)). 

6.  CONCLUSIONS 

Many  low-level  vision  problems  have  a  common 
characteristic.  We  are  given  image  data  that  do  not 
uniquely  determine  the  physical  parameters  of  the  object 
in  view  (  as  such  parameters  we  chose  shape  in  this 
paper).  This  problem  is  usually  addressed  by  adding 
smoothness  constraints  that  force  the  solution  to  be 
unique.  We  have  shown  how  any  quadratic  smoothness 
constraint  can  be  learned  from  examples  by  using  well 
known  mathematical  techniques  such  as  Greville's 
method  for  recursive  computation  of  a  least  squares  solu¬ 
tion  and  the  Robbins-Monro  technique  of  stochastic 
approximation.  The  Robbins-Monro  technique  works 
under  a  variety  of  noise  conditions  and  is  easily  imple- 
inentable  in  massively  parallel  networks.  This  technique 
provides  a  uniform  theory  for  learning  smoothness  con¬ 
straints  in  low-level  vision.  The  convergence  can  be 
accelerated  by  using  a  priori  knowledge  of  the  relation¬ 
ship  between  image  data  nad  physical  properties  but  the 
ultimate  answer  provided  by  the  system  does  not  depend 
on  these  assumptions.  Once  the  connectionist  network 
has  learned  all  the  parameters  involved  in  the  computa¬ 
tion  of  real  world  properties  from  image  data,  it  can  use  a 
simple  iterative  technique  based  on  the  theory  of  fixed 
points  to  find  a  unique  solution  to  problems  such  as 
“shape  from  X”.  This  fixed  point  method  extends  the 


usual  linear  regularization  theory  and  shows  how  one  can 
uniquely  solve  a  non-linear  variational  problem  provided 
the  solution  at  the  boundaries  is  known.  We  are  currently 
experimenting  with  our  theory,  using  synthetic  and  real 
images.  The  details  of  our  connectionist  implementation 
that  is  based  on  ideas  from  [38],  and  our  experimental 
results  will  appear  elsewhere. 

REFERENCES 

1:  H.  Kusher  and  D.  Clark,  Stochastic  approximation 
methods  for  constrained  and  unconstrained  systems, 
Springer  Verlag,  1978. 

2:  M.T.  Wasan,  Stochastic  Approximation,  Cambridge, 
1909. 

3:  L.  Bromberg,  Bayesian  analysis  of  linear  models,  Mar¬ 
cel  Dekker,  1985. 

4:  Y.Z.  Tsypkin,  Foundation  of  the  Theory  of  Learning 
Systems,  Academic  Press,  1973. 

5:  U.  Herkenrath,  D.  Kalin  and  W.  Vogel  (Eds.), 
Mathematical  Learning  Models:  Theory  and  Applications, 
Springer  1983. 

6:  T.  Kohonen,  Associative  Memory:  A  Systems  Theoret¬ 
ical  Approach,  Springer,  1977. 

7:  A.  Albert,  Regression  and  the  Moore-Penrose  Inverse, 
Academic  Press,  1972. 

8:  R.E.  Kalman,  “A  new  approach  to  linear  filters  and 
prediction  problems”,  in  J  Basic  Eng.  Trans.  ASME 
series  D.,  82,  35-15. 

9:  L.  Ljung,  “Analysis  of  recursive  stochastic  algorithms”, 
IEEE  Trans.  Aut.  Control,  AC-22,  551-75,  1977. 

10:  P.  Maybeck,  Stochastic  Models,  Estimation  and  Con¬ 
trol,  Academic  Press,  1979. 

11:  R.  Penrose,  “A  generalized  inverse  for  matrices”, 
Proc  Cambridge  Phil  Soc  51,  406-13,  1955. 

12:  A.  Ben-Israel  and  T.  Greville,  Generalized  Inverses, 
Theory  and  Applications,  Wiley,  1974. 

13:  H.  Robbins  and  S.  Monro,  “A  stochastic  approxima¬ 
tion  method”,  Ann.  Math.  Stat  22,  1951. 

14:  K.L.  Chung,  “On  a  stochastic  approximation 
method”,  Ann.  Math.  Slat  25,  1954. 

15:  A.E.  Albert  and  L.A.  Gardner,  Stochastic  Approxima¬ 
tion  and  Nonlinear  Regression,  MIT  Press,  1967. 

16:  V.A.  Morozov,  Methods  for  Solving  Incorrectly  Posed 
Problems,  Springer  1984. 

16:  B.K.P.  Horn,  Robot  Vision,  McGraw-Hill,  1986. 

17:  A.  Witkin,  “Recovering  surface  orientation  and  shape 
from  texture”.  Artificial  Intelligence,  17,  1981. 

18:  J.  Aloimonos,  “Shape  from  texture”,  Proc.  IEEE 
CVPtt,  1986. 

19:  K.  Stevens,  "Shape  from  texture  and  contour”,  Ph.D. 
thesis,  MIT,  1981. 

20:  S.  Ullman,  “The  interpretation  of  structure  from 
motion”.  Proc.,R.  Soc.  Lond.  B  203,  405-426,  1979. 


87Z 


21:  T.  Kanade,  “Determining  the  shape  of  an  object  from 
a  single  view”,  Artificial  Intelligence,  17,  1981. 

2 '2:  J.  Aloimonos  and  A.  Basu,  “Shape  and  motion  from 
contour  without  correspondence",  Proc.  IEEE  CVPR, 
1986. 

23:  D.  Marr,  Vision,  McGraw  Hill,  1982. 

21:  A.N.  Tichonov,  and  V.Y.  Arsenin,  Solution  of  Ill- 
Posed  Problems,  Winston  and  Wiley  Publishers,  Wash¬ 
ington  D.C.,  1977. 

25:  T.  Poggio  and  C.  Koch,  “Ill-posed  problems  in  early 
vision:  from  computational  theory  to  analog  networks”, 
Proc  Royal  Society  of  bond  ,B,  1985. 

26:  S.  Shafer,  personal  communication.  1986. 

27:  J.  Aloimonos.  “Computing  intrinsic  images”,  Ph.D. 
thesis.  University  of  Rochester,  1986. 

28:  T.  Poggio  and  V*  Torre,  “Ill-posed  problems  and  reg¬ 
ularization  in  early  vision".  Artificial  Intelligence  Lab 
Memo.  no.  773,  MIT,  Cambridge,  Massachusetts. 

29:  D.  Lee,  “An  algorithm  for  shape  from  shading”,  Proc. 
Image  Understanding  Workshop,  1985. 

30:  G.D.  Smith,  Nurnetical  Solution  of  Partial  Differential 
Equations :  Finite  Difference  Methods,  Oxford  University 
Press,  1978. 

31:  K.  Ikeuchi,  “Shape  from  regular  patterns”,  Artficial 
Intelligence,  22,  *19-75,  198-1. 

32:  J.  Render,  “Shape  from  texture”,  Ph.D.  thesis,  Carne¬ 
gie  Mellon  University,  1981. 

33:  T.  Kanade  and  J.  Render,  “Skewed  symmetry:  map¬ 
ping  image  regularities  into  shape”.  Technical  Report, 
CMU-CS-80-133,  Dept,  of  Computer  Science,  Carnegie- 
Mellon  University. 

34:  Y.  Ohta,  K.  Maenobu  and  T.  Sakai,  "Obtaining  sur¬ 
face  orientation  from  texels  under  perspective  projec¬ 
tion”,  Proc  7th  IJCAI,  Vancouver  Canada,  1981, 

35:  J.  Aloimonos  and  M.  Swain,  “Shape  from  texture". 
Proc  IJCAI,  Los  Angeles,  CA,  1985. 

36:  J.  Iladamard,  Lectures  on  the  Cauchy  Problem  in 
Linear  Partial  Differential  Equations,  New  Haven:  Yale 
University  Press. 

37:  T.  Poggio,  “MIT  Progress  in  understanding  images", 
Proc.  Image  Understanding  Workshop,  1985. 

38:  J.  Feldman,  "Four  frames  suffice:  A  provisional  model 
of  vision  and  space”  Behavioral  and  Brain  Sciences,  June 
1985. 


873 


u>(  At,  MiAl'i;  I'KOiM  shkciilauity 


Clt'jin  !  I  €  *i  i  lo  v  nm!  Thomns  O.  Bluforcl 


Arlilici.il  Ini diligence  Laboratory 
Computer  Science  I  *cpartineiit 
Stanford  University 
Stanford,  California  94305 


Abstract 

H  i  sht/.r  (hut  highlights  in  nnngi  >  of  objects  with 
•  i'i  >  nltirlij  njhrhng  surfaets  pn/ndt  s/yinj.ennl  tnfor - 
rintn>n  nlnml  tin  surf  net  s  winch  tp  lit  rate  (hem.  .1  hr  ,cS 

y  is  girt  n  of  sptculor  rtjlerlnnre  intuit  l.*  which  have 
\><n  //  w  >  ■  m  tompuler  rismn  ami  grnphits.  For  our 
■  ,ok.  wt  (I'lufil  (In  FitrnuntSparrow  spt-ular  model 
■.  inch,  unltkt  most  pn  nous  nnult  Is,  considers  (he  wi¬ 
th  r!  ping  physics  of  specular  n  Jit  chon  from  rough  sur- 
Jrccs.  From  tins  moth  l  wt  tlt  rin  poin  tful  nlalionships 
!•  I  wt  i  n  (In  properties  of  a  sptcu!»ir  ftafurt  in  on  image 
i i  :nl  (two!  piitpt  rtn  s  of  the  corn  s ponding  .* nrfocc .  11  t 
s how  Innr  (Ins  analysis  etui  in  ustd  for  both  prnliclion 
end  / it(<  rpn  (<;( ion  in  a  nsion  system.  1  shape  from 
spt  <  uln  nt  if  sysltm  has  Inin  nn  pit  tin  nit  il  to  It  si  our  ap- 
proach.  Ihc  pt  rformnnet  of  (In  sysltm  »>  thinonsl  nth  d 
by  itinful  ujm  rmini/s  with  sptcularly  reflating  object s. 


1.  Introduction 

When  light  is  incident  on  a  surface,  some  fi  nd  ion 
cf  il  is  reflected.  A  perlectly  smooth  MU  lace  rcllccts 
light  only  in  I  he  direction  siu*li  dial  die  angle  of  inci¬ 
dence  equals  (lie  angle  of  rcllecliou.  For  (Higher  sur- 
laces,  e.g.  die  surface  of  a  melal  fork,  specular  ellects 
are  still  observable.  In  this  paper  we  analyze  the  prop¬ 
erties  of  specular  rcllecliou  from  rough  surfaces. 

'There  are  numerous  reasons  why  the  si  inly  of  spec¬ 
ular  reflection  deserves  serious  attention  in  computer 
\'sion.  Specular  haliiics  are  almost  always  die  bright- 
« st  region*-  in  an  image.  Contras*  is  often  large  across 
spin  lilarities;  they  are  very  promiiicnl.  In  addition,  the 
presence  or  absence  of  specular  features  provides  imme¬ 
diate  constraints  on  the  positions  of  the  viewer  and  light 
“Hirers  r* dative  in  the  specular  surface.  Also,  as  we  will 
v how.  du*  properties  of  a  specular  ly  constrain  the  local 
diape  and  orient at mn  of  the  specular  surface. 

Ah  abdiiv  to  iiiiderstand  spr -cnlar  features  is  valu¬ 
able  fo;  any  vision  system  which  must  interpret,  images 
(»  glossy  surfaces.  This  work,  motivated  by  experi- 


(  n*  e  u  idi  ACItO.WM  hi],  began  in  order  to  provide 
1  he  SI  (’(’I  SS(  )H  system  with  d-e  capability  to  reason 
about  specular  rcflcelion  from  metal  parts  in  the  ITA 
project  til.  linages  ofdie.se  parts  typically  contain  large 
specular  regions  (Figure  I). 


Figure  I.  Typical  Image  Containing  Specularities 


We  examine  what  information  can  be  inferred  from 
an  image  of  a  rough  surface  by  considering  t  he  physics 
nl  specular  ivflcclimi.  (‘articular  emphasis  is  placed  oil 
limling  smiiIhiTic  quasi-invariant  relationships  which  will 
hold  in  in  . n \  diflcrctil  sil  iiiii  inns  (e.*;.  difl'crciil.  source, 
viewer  conliv.iirat  ions).  In  contrast.  lo  mans  intensity - 
li.isnl  vision  al|'iiiil Iiiiis,  we  cnm|>(ltc  a  small  mtllilior  of 
local  surface  statistics  based  on  I  In-  |>ro|)crtics  of  a  rel- 
alis.'ly  lari'!'  number  of  pixels  in  an  image.  This  allows 
ns  lo  observe  prc.lit ted  f.-aturrs  ami  infer  local  surface 
shape  in  noisy  intensity  images  ami  in  vases  where  avail- 
able  s|>eeular  models  do  not  completely  characterize  llic 
|diysies  of  s|ieeular  relleetion. 


874 


2.  Review  of  Previous  Work 

Researchers  in  computer  graphics  have  used  in¬ 
creasingly  realistic  specular  models.  Several  of  these 
models  will  be  discussed  in  the  next  section.  In  com¬ 
puter  vision,  however,  relatively  lew  attempts  have 
been  made  to  exploit  the  inform;  lion  encoded  in  spec- 
lilarities.  Ikcuchi  1 1 1> j  employs  the  photometric  stereo 
method  [21]  and  uses  distributed  light  sources  to  deter¬ 
mine  the  orientation  of  patches  on  a  surface.  Crimson 
[ll]  uses  I’liong’s  specular  model  |I8]  to  examine  spcc- 
ularities  from  two  views  in  order  to  improve  the  perfor¬ 
mance  of  surface  interpolation.  Coleman  and  Jain  [7] 
use  four-source  photometric  stereo  to  identify  and  cor¬ 
rect  for  specular  reflection  components.  In  more  recent 
work,  lllake  |2]  assumes  smooth  surfaces  and  single  point 
specularities  to  derive  equations  to  infer  surface  shape 
using  specular  stereo,  lie  shows  that  the  same  equa¬ 
tions  can  be  used  to  predict  the  appearance  of  a  spec¬ 
ularity  on  a  smooth  surface  when  using  a  distributed 
light  source.  I'akai,  Kimura,  and  Sala  [221  describe  a 
model-based  vision  system  which  recognizes  objects  by 
predicting  specular  regions.  As  specular  models  and  in- 
rigltls  improve,  we  expect  to  sec  more  work  which  makes 
use  of  I  he  properties  of  specular  reflection. 


3.  Specular  Reflectance  Models 

Given  a  viewer,  a  surface  patch,  and  a  light  source, 
reflectance  model  quantities  the  intensity  the  viewer 
will  perceive.  General  redeetanre  models  represent  the 
perceived  intensity  I  as  a  sum  of  two  reflection  compo¬ 
nents 

l  -  Id  +  Is  (!)• 

Ip  represents  the  intensity  of  diffusely  reflected  light 
anti  Is  represents  the  intensity  of  specularly  reflected 
light.  In  this  paper  we  restrict  our  attention  to  the  Is 
reflection  component. 

We  note  that  it  is  typically  easy  to  separate  the  Is 
reflection  component  from  the  l o  reflection  component 
in  an  image.  There  are  several  distinctive  properties  of 
specular  reflection.  Over  most  of  a  surface  /<,-  is  zero, 
but  in  specular  regions  Is  is  usually  very  large  relative  to 
Ip.  In  regions  where  the  specular  component  is  nonzero, 
Is  changes  much  more  rapidly  with  surface  geometry 
than  Ip.  Furthermore,  the  color  of  the  Is  reflection 
component  is  almost  always  different  from  the  color  of 
the  l o  reflection  component. 

Before  discussing  the  various  specular  reflectance 
models,  wo  introduce  the  reflection  geometry  (Figure 
2).  We  consider  a  viewer  looking  at  a  surface  point  P 


which  is  illuminated  by  a  point  light  source.  Define 

T  —  unit  vector  from  P  in  direction  of  viewer 
N  -  unit  surface  normal  at  P 
L  unit  vector  from  P  in  direction  of  source 
fl  —  nl  I  I' ||  (unit  angular  bisector  of  V  and  f) 

a  =  cos  l(N  ■  II)  (the  angle  between  N  and  //) 


Figure  2.  The  Reflection  Geometry 


In  describing  specular  models,  we  consider  illumi¬ 
nation  from  a  single  point  light  source.  In  principle, 
we  lose  no  generality  using  this  approach.  In  situations 
involving  distributed  light  sources,  we  only  need  to  inte¬ 
grate  the  ellVcls  of  an  equivalent  array  of  point  sources. 
A  discussion  of  the  geometry  of  extended  sources  is  given 
in  [Mj. 

The  simplest  specular  model  assumes  that  specu- 
laritics  only  occur  where  the  angle  of  incidence  equals 
lire  angle  of  reflection  and  L,  N,  and  1  all  lie  in  the 
seine  plane.  I’liis  corresponds  to  the  situation  a  0  in 

Figure  2.  Unless  the  surface  is  locally  flat,  this  model 
predicts  that  specularities  will  only  he  observed  at  iso¬ 
lated  points  on  a  surface.  A  few  experiments,  however, 
show  that  this  model  is  inadequate  for  most  real  sur¬ 
faces.  Not  only  are  observed  specular  features  usually 
larger  than  single  points,  but  highlights  often  occur  in 
places  which  arc  not  predicted  by  this  model. 

An  empirical  model  for  specular  reflection  has  been 
developed  by  Phong  [18]  for  computer  graphics.  This 
model  represents  the  specular  component  of  reflection 
by  powers  of  the  cosine  of  the  angle  between  the  perfect 
specular  direction  and  the  line  of  sight.  Thus,  Phong’s 
model  is  capable  of  predicting  specularities  which  extend 
beyond  a  single  point.  While  Phong’s  model  gives  a  rea¬ 
sonable  approximation  which  is  useful  in  some  contexts, 
the  parameters  of  this  model  have  no  physical  mean¬ 
ing.  It  is  possible  to  develop  more  accurate  models  by 
examining  lire  physics  underlying  specular  reflection. 

The  Torrance-Sparrow  model  [23],  developed  by 
physicists,  is  a  more  refined  model  of  specular  reflec¬ 
tion.  This  model  assumes  that  a  surface  is  composed 


875 


of  small,  randomly  oriented,  mirror-like  faeels.  Only 
facets  with  a  normal  in  the  direction  of  If  contribute 
to  /$.  The  model  also  quantifies  the  shadowing  and 
masking  of  facets  by  adjacent  facets  using  a  geometrical 
attenuation  factor.  The  resulting  specular  model  is 

Is  ---PDA  (2) 

where 

F  —  Fresnel  coefficient 

D  -  facet  orientation  distribution  function 

A  adjusted  geometrical  attenuation  factor 

We  will  analyze  the  effects  of  each  factor  in  tile  model 
in  thi'  nexi  few  paragraphs.  The  results  we  present  in 
this  paper  are  derived  from  equation  (2). 

The  Fresnel  coefficient  F  models  the  amount  of  light 
which  is  reflected  from  individual  facets.  In  general,  F 
depends  on  the  incidence  angle  and  the  complex  index 
of  refraction  of  the  reflecting  material.  Cook  and  Tor¬ 
rance  [S]  have  shown  that  to  synthesize  realistic  images, 

F  must  characterize  the  color  of  the  specularity.  The 
Fresnel  equations  predict  that  I'  is  a  nearly  constant 
function  of  incidence  angle  for  the  class  of  materials  with 
a  large  extinction  coefficient  [21].  This  class  of  materi¬ 
als  includes  all  metals  and  many  other  materials  with  a 
significant  specular  reflection  component. 

The  distribution  function  D  describes  the  orienta¬ 
tion  of  the  micro  facets  relative  to  the  average  surface 
normal  A’.  Illiiiu  [3|  and  Cook  and  Torrance  [8]  discuss 
various  distribution  functions.  All  of  these  functions  are 
very  similar  in  shape.  In  agreement  with  Torranre  and 
Sparrow  we  use  the  Gaussian  distribution  fu action  given 
by 


D  ■=  fu  (3) 

where  K  is  a  normalization  constant.  'Tims,  for  a  given 
a ,  I)  is  proportional  to  the  fraction  of  facets  oriented 
in  the  direction  II.  The  constant  m  indicates  surface 
roughness  and  is  proportional  to  the  standard  deviation 
of  the  Gaussian.  Small  values  of  m  describe  smooth 
surfaces  for  which  most  of  the  specular  reflection  is  con¬ 
centrated  in  a  single  direction.  Large  values  of  m  arc 
used  to  describe  rougher  surfaces  with  larger  differences 
in  orientation  between  nearby  facets.  'These  rough  sur¬ 
faces  produce  spccularilics  which  appear  spread  out  on 
the  reflecting  surface.  Figure  3  shows  the  elfecl  of  dif¬ 
ferent  values  of  hi. 


Incident  ray 


reflected  rays 


Figure  3a.  Specular  Distribution  for  Small  m 


Figure  3b.  Specular  Distribution  for  Large  m 


The  factor  A  quantifies  the  efTecls  of  a  geometri¬ 
cal  attenuation  factor  G  corrected  for  foreshortening  hy 
dividing  by  (N  -V). 


N  ■  V 

G  is  derived  by  Torrance  and  Sparrow  in  [23].  They 
i  ssiime  that  each  specular  facet  makes  up  one  side  of  a 
symmetric  v-groove  cavity.  From  this  assumption,  they 
examine  (fie  various  possible  facet  configurations  which 
correspond  to  shadowing  or  masking.  The  expression  is 


(1  min 


2{N  ■lj){N-V)  2( N  •  lf)(N  ■  t)  | 
(V-II)  '  (V  ■  H)  / 


We  will  show  that  in  applications  it  is  often  possible  to 
use  a  simpler  expression  for  G. 

Let  ii  he  the  angle  between  AT  and  V.  As  /r  increases 
from  0  to  ,  the  viewer  gradually  sees  a  larger  part  of 
(he  reflecting  surface  in  a  unit  .area  in  the  view  plane. 
Therefore,  as  /t  gels  larger,  there  are  correspondingly 
more  surface  facets  which  contribute  to  the  intensity 
perceived  by  the  viewer.  We  lake  this  phenomenon  into 
account  in  (1)  by  dividing  by  N  ■  V . 


876 


4.  Slinpe  from  Specularity 

In  this  section,  wc  demonstrate  how  we  can  use  (2) 
tu  determine  local  surface  properties  from  specnlnritics. 
Ill  almost  all  situations  we  do  not  require  the  full  gen¬ 
erality  of  (2)  to  infer  these  local  properties.  Our  first 
assumption  is  that  I*'  is  a  constant  with  respect  to  view¬ 
ing  geometry.  This  is  a  very  good  approximation  for 
metals  ami  for  many  other  materials.  We  can  further 
simplify  (2)  by  observing  that  the  exponential  factor  in 
(;()  changes  much  faster  than  any  of  the  terms  of  A. 
Therefore,  except  for  a  small  range  of  angles  near  graz¬ 
ing  incidence,  A  can  he  considered  constant  across  the 
specularity.  We  will  discuss  the  consequences  of  this  as¬ 
sumption  later.  Hence,  the  form  of  (2)  used  to  determine 
local  surface  properties  is 

ls  ----  K'e  (6), 


where  A  '  is  a  constant. 

Refeiiiug  again  to  the  geometry  of  figure  2,  we 
assume  that  the  viewer  and  light  source  are  distant  rel- 
aitvc  to  the  dimensions  of  the  surface.  Therefore  V  and 
I  may  lu-  r<  garded  as  constant;  lienee  their  angular  bi¬ 
sector  II  is  also  constant.  We  assume  that  the  positions 
of  the  viewer  and  light  source  are  known.  Finally,  since 
the  distance  from  the  viewer  to  the  surface  is  large,  we 
can  approximate  the  perspective  projection  of  the  imag¬ 
ing  device  with  an  orthographic  projection. 


4.1.  Inferring  Local  Surface  Shape 

For  a  surface  M  on  which  the  Gaussian  curvature  is 
locally  non/ero,  we  will  be  able  to  locate  a  single  point 
/’ii  of  maximum  intensity  in  the  image  of  the  specular¬ 
ity.  From  (ti)  we  see  that  I  his  point  corresponds  to  the 
local  surface  orientation  N  //  (i.o.  «  0).  Given 

such  a  surface  where  II  is  known,  we  can  immediately 
determine  the  surface  orientation  at  Py. 

Figure  4  shows  a  typical  intensity  surface  for  a  spec¬ 
ular  image.  The  level  sets  are  image  curves  of  constant 
specular  intensity.  Pu  corresponds  to  a  =  0.  As  pre¬ 
dicted  by  (6),  specular  intensity  decreases  as  we  move 
away  from  Py. 

Alter  locating  Py,  we  can  transform  the  specular 
intensity  image  to  the  a  angle  image.  Consider  equation 
(0).  If  /'  is  the  specular  intensity  corresponding  to  an 


Po 


Figure  1.  Specular  Intensity  Surface  for  a  Curved  Surface 

arbitrary  image  point  I’1  near  P,,,  then  the  angle  n  at 
the  surface  point  imaging  to  I”  is  given  by 


r  ,  , 

1**1  -  (<) 


We  see  that  rr  is  determined  only  up  to  sign.  II I  is  will 
cause  the  sign  of  the  normal  curvature  computed  at  l\, 
to  he  ambiguous.  In  applications,  this  ambiguity  can 
usually  be  resolved  by  considering  other  cues.  From  (7) 
we  can  compute  the  absolute  value  of  <*  corresponding 
to  each  point  I”  in  a  neighborhood  of  /’„.  File  image  of 
|r»|  values  is  railed  the  n  angle  image. 

From  the  <»  angle  image,  we  can  compute  local  cur¬ 
vature  properties  of  the  surface.  Let  //',,( d  / )  he  the 
tangent  spare  to  M  at  To  compute  curvatures,  we 
take  a  finite  number  u  of  straight  line  samples  of  the  « 
angle  image  intersecting  To  insure  uni  orm  angular 

resolution  on  tin*  surface,  these  samples  must  he  taken  in 
equally  spaced  directions  in  /  /*,,  ( A / ).  In  general,  equally 
spared  directions  in  the  image  will  not  correspond  to 
equally  spared  directions  ill  the  tangent  space.  Tims, 
given  a  direction  in  the  tangent  space  to  the  surface  at 
/’„,  we  need  to  determine  the  corresponding  direction  in 
the  image. 

Consider-  a  2-D  coordinate  system  such  that  the 
viewer  is  looking  down  the  z-axis  and  Py  is  at  the  origin 
(Figure  5). 

At  I*  we  have  N  li  so  that  the  vector  If  is  normal 
to  the  tangent  space  to  the  surface  at  I’.  Wo  choose  to  de- 
line  angles  in  the  tangent  space  in  the  counterclockwise 


i 

i 


877 


A 


mid  since  Vi  is  a  unit  vector  we  have 


imago  piano 


—  Y 


Pigttre  5.  The  Projection  Geometry 

sense  from  y  -  0  mill  in  the  half  space  x  >  0.  Denote 
the  normal  to  the  surface  at  P  by  N  (Ni,  N-},  N3). 
The  tangent  space  to  the  surface  at  P  is  given  by 


•Vj  x  I  TV;;!/  -t  /V3  z  -  0  (8) 

Along  y  0,  the  unit  vector  V'u  in  the  tangent  space  is 


In  (  ~^s  ,0,--  (9) 

V  A;32  T  y  Nj  I  Nf 


To  simplify  the  notation  let 


A', 


-AG 


N, 


\M  h 


K,  =  (10) 

V  Af|  +  Nf 


Let  0  be  the  angle  of  interest  in  the  tangent  space. 
The  goal  is  to  find  a  unit  vector  l)  (-e’l ,  y\ ,  z\ )  which 
lies  in  the  tangent  space  and  makes  an  angle  8  with  Vy. 
The  angle  0  provides  the  constraint 

*(A'i  T  2 1  A' 2  •  cosO  (II), 

Prom  (8)  we  must  reiptire 

*tAf,  4  yiN2  I  z,N3  -  0  (12), 


*1  i  V?  T  2|  -  1 

The  eipiations  (II),  (12),  (13)  may  be  solved  uniquely 
for  xi,yi>zi  in  tlic  half  space  x  >  0.  Hriolly,  the  solution 
is 


_  zn,  i  Jfl.K 3  . 

:i  2/f,  [  ’ 


*,  -  (74  -  Chzx  (15) 


Af i  ziNz 

1/1  - r. - — "  (10) 

jv2  n% 


where 


Hi  (hCl  I  C>  C3C3  (17a) 
«2  •-=  2C,C4r0  I  <7;,c,  (176) 

/?3  r  C i <7 J  1  (17c) 


1 1 4  "8“> 


ivj 


C2  =.  1  T  T;  (186) 
2N,N3 


C 


C<  = 


*2 

CO.lS 


(18c) 

(18d) 


C3  lp  (18c) 

A  l 

A  special  case  occurs  when  /V2  0.  Porthis  case  wc  use 

(11)  -  (13)  to  arrive  at 


-  A'i co«0 

z\  =  v - r.  (19) 

A  i  A'j  -  A'2  N3 


878 


(20) 


<1.2.  Spcciiil  Ciiscs 


CO.il)  —  Z  I  f\  2 


V\  =  \/0  -  ’?  -  •'•?)  (21) 

We  use  (I  I)  -  (21)  to  compute  tli«-  components  of  V\ . 

Assumin';  an  orthographic  pro jection  and  examin¬ 
in';  (hr  geometry  of  Figure  5,  we  see  I  licit  tin*  i ,  // 1  com¬ 
ponents  of  \  j  give  us  I  lie  projected  vector  w-  are  seeking 
in  t  lie  image.  Let  0*  l>e  I  lie  image  angle  corresponding 
lo  l  i .  Then 


O'  /(.»,  1  (  — )  (22) 

*1 

The  next  step  is  to  use  the  o  image  to  compute  the 
normal  curvature  of  the  surface  in  the  direction  0.  The 
normal  curvature  is  computed  by  taking  a  st  raight  line  1/ 
in  tlie  n  image  which  intersects  /’,  and  is  in  the  direction 
O'.  tender  orthographic  projection,  h  will  project,  to  a 
Inn-  I.1  in  /’/..( ;U  ).  The  goal  is  to  compute  the  normal 
curvature  of  the  emve  C  M  where  (•  is  the  orthogonal 
projection  of  //  onto  M.  Since  (’  projects  to  a  line  in 
the  magnitude  of  the  geodesic  curvature  |/v,,| 
of  C  is  I)  IT;.  Thus  local  changes  in  <*  alo  ig  (J  are  due 
primarily  to  normal  curvature  along  (\  We  compute  the 
normal  curvature  t; „  in  the  direction  0  by 


d(\ 

lU 


(23) 


where  s  is  arc  length  in  the  direction  0.  In  other  words, 
we  are  differentiating  the  <»  image  along  b  with  respect 
to  arc  length  on  the  surface.  From  tin-  local  character  of 
specnlarities,  we  see  that,  to  a  very  good  approximation, 
i  rc  length  on  the  surface  is  e»pial  to  length  in  7'/»  (A/). 
Therefore,  length  in  the  image  and  arc  length  on  the 
surface  are  related  l>v  the  scale  factor  \ ,  .e'j  i  »yj’.  Thus, 
lompnting  normal  curvature  on  the  surface  has  hern 
i  educed  to  different  hit  ion  in  the  n  image. 

If  we  let  0  vary  in  the  range  0  ''  0  2 n  we  can 

compute  h „  in  anv  iiiiiiiImt  of  directions  at  /V  The 
principal  curvatures  of  \1  at  /*,  are  defined  to  be  the 
maximum  and  minimum  values  of  the  correspond¬ 
ing  directions  are  called  the  principal  directions.  Hence, 
i  -  ing  this  tcdinicpie  it  is  possible  t .»  describe  M  locally  to 
tceoiid  order  in  t ••rms  of  principal  curvatures  and  prin¬ 
cipal  directions.  In  the  context  »»f  shape  from  shading 
lit.  ISruss  .Y  and  Deift  and  Svlvster  4b  examine  the 
,  ssuinpl ions  re< j 1 1 i r<d  I « »  generate  higher  tinier  surface 
descriptions  from  an  image. 


(n  this  subsection  we  examine  specular  reflection 
from  special  (lasses  of  surfaces.  In  1.2.1.  and  4.2.2. 
we  consider  surfaces  which  are  locally  singly  curved  and 
planar  respectively.  lor  these  surfaces,  the  Oaussiau 
curvature  is  locally  zero.  In  4.2.2.  we  examine  the  case 
of  corners  and  edges  where  surface  normal  is  discontin¬ 
uous  but  where  specnlarities  are  frequently  observed. 

4.2. 1  .Singly  Curved  Surfaces 

If  one  principal  curvature  of  a  surface  is  zero  in  a 
specular  region  (i.e.  the  surface  is  locally  singly  curved), 
we  will  not  he  able  to  infer  immediately  the  local  orien¬ 
tation  as  we  did  for  a  doubly  curved  surface.  To  under¬ 
stand  why,  consider  Figure  G.  Figure  G  shows  a  viewer 
looking  at  a  tilted  cylinder.  To  make  the  example  con¬ 
crete,  assume  that  L  is  such  that  //  I  .  fun*  this 
i  ouligural  ion  there  will  he  no  point  on  (lie  surface  for 
which  a  0  (recall  that  II  is  essentially  constant),  yet 
we  will  still  observe  a  specularity  in  the  image  if  at  some 
point,  /v  is  small  enough  to  give  a  significant  value  for  f $ 
in  (G).  Define  */f  to  be  the  smallest  value  of  a  for  a  given 
mm  face-source- viewer  configuration.  Figure  7  shows  a 
specularity  generated  h.v  «i  cylinder  which  is  oriented  so 
that  t/j  is  20".  Note  that  a  specular  model  which  assumes 
a  siuoot  h  surface  would  not  predict  a  specularity  for  this 
case. 


879 


We  i»lis»*r  ve  that  it  is  I  v  pleat  I  \  «.■.»  i .  .  S*  (  ii  ilia  I  a 
surface  is  singly  curved  at  a  spec u'urit  \ .  1  l,is  b. cause 
\.e  will  observe  a  line  of  maximum  intensity  (along  the 
hue  of  zero  curvature)  instead  of  tin*  point  maximum  we 
observe  for  the  doubly  curved  case. 

figure  8  is  a  plot  of  /s  for  a  singly  curved  surface 
in  a  direction  perpendicular  to  the  lines  of  zero  curva¬ 
ture  as  we  change  ,y  It  }s  worth  noting  that  both  the 
magnitude  and  shape  of  /s  change  as  increases.  Con¬ 
sequently.  it  is  possible  to  recover  significant  local  shape 
informal  ion  for  this  class  of  surfaces. 


f  igure  8.  /.s  for  dilfercnl  values  of  0 


1  2. 2. Planes 

For  a  planar  surface,  N  is  constant.  Hence,  recall¬ 
ing  our  basic  assumptions,  /s-  is  constant  across  a  plane. 
If  I  lie  plane  is  oriented  such  that  o  is  small  enough,  then 
a  viewer  will  perceive  an  Is  relleclioii  component.  As 
'.ith  the  singly  carted  surface,  the  magnitude  of  t lie  per- 
< .  ived  intensity  will  depend  on  a.  If  n  is  not  sufficiently 
small,  then  />  will  be  zero  at  all  points  on  the  plane. 
These  observations  provide  ns  with  two  useful  pieces  of 
informal  ion: 

I  Clossv  surfaces  which  don’t  generate  specularities 
over  a  range  of  orientations  are  probably  planar. 

2.  Surfaces  which  produce  a  specularity  of  constant 
intensity  over  a  2-1)  region  in  the  image  are  locally 
planar. 

Corners  mid  Edges 

Specuhr  ilies  are  often  observed  at  places  of  discon¬ 
tinuous  siirlace  normal  on  an  object.  Typical  examples 
of  these  discontinuities  are  edges  ami  corners  on  a  poly¬ 
hedron.  Fo*‘  an  ideal  edge  oil  a  polyhedron,  the  surface 
normal  is  discontinuous  across  the  edge  (Figure  9). 


A 


■>  2 


Figure  9.  An  Ideal  Edge 

for  an  ideal  edge  joining  two  planes,  we  should  not 
expect  to  observe  a  specularity  unless  either  \  |  or  /V-j 
is  orient  ed  in  a  direct  ion  which  is  sullicicnl  ly  close  to 
I  lie  perfect  specular  direct  ion  II.  Hut  for  this  case,  as 
discussed  in  1 .2.2.,  we  would  expect  to  observe  a  spread 
out  specular  feature  on  one  of  the  two  planes  joined  by 
I  lie  edge.  So  wliv  do  we  frequently  sec  specular  rcllec- 
t ions  along  edges?  On  real  polyhedra,  siirlace  normals 
are  usually  continuous  across  rdg  *s.  Instead  of  the  nor¬ 
mal  vector  changing  discon!  inuou-.i  v,  the  normal  usually 
changes  smoothly  from  .Vj  lo  A  _>  by  taking  \alucs  which 
art*  linear  combinat  ions  of  A'j  and  Aj .  As  vc  cross  ail 
edge,  the  surface  normal  moves  lapidly  through  a  large 
range  of  angles.  If  any  of  these  normals  is  oriented  in 
a  direction  sufficiently  near  the  d  reel  ion  //,  we  will  ob¬ 
serve  a  specularity.  Therefore,  we  often  observe  a  spec¬ 
ularity  mi  an  edge,  f  igure  10  shows  an  image  of  ail  edge 
specularity. 


Figure  10.  Image  of  an  Edge  Specularity 


The  situation  is  similar  for  trihedral  vertices.  As  with 
edges,  I  he*  normal  vector  is  usually  continuous  at  a  cor- 


ncr.  For  trihedral  vertices,  the  normal  vector  typically 
takes  on  values  w loell  are  lineal  <  innliinal  ions  of  the 
three  normals  corresponding  to  tile  three  planes  defin¬ 
ing  the  vertex. 

From  experiments  with  poly  India,  we  have  devel¬ 
oped  a  useful  model  for  the  behavior  of  surface  normal 
aernss  edges.  Define  r  to  be  the  edge  sharpness  param¬ 
eter  and  assume  the  coordinate  system  of  figure  II.  t’\ 
ami  l‘<  denote  two  planes  intersecting  to  form  an  edge. 


A>, 


Figure  11.  The  Geometry  of  a  Hounded  Rdgc 

The  y  axis  is  aligned  in  the  direction  of  Ari  and  the 
origin  is  a  distance  r  from  I\  such  that  the  normal  to 
the  surface  f\  begins  to  turn  away  from  A i  when  x 
becomes  positive.  The  model  for  the  normal  as  it  turns 
from  A' i  to  /V2  is 


/' 2  2 

A  =:  -A’,  +  1  A2  for  0  <  X  <  r| A,  x  A2|(24) 

r  r 

fn  otli.T  words,  the  normal  is  assumed  to  turn 
through  a  curve  of  constant  curvature  -.  Here  the  pa¬ 
rameter  r  is  used  to  specify  the  sharpness  of  the  edge. 
Small  values  of  r  indicate  sharp  edges,  while  larger  val¬ 
ues  of  r  indicate  more  rounded  edges.  Figure  12  shows 
the  profile  of  a  specularity  across  a  sharp  90"  edge  which 
is  similar  t  ■  the  profile  predicted  by  the  continuous  nor¬ 
mal  variation  of  (21). 


4.3.  Predicting  A 

In  the  previous  analysis  we  have  assumed  that  over 
most  configurations  of  viewer,  source,  and  surface  the 
adjusted  geometrical  attenuation  factor  A  of  (1)  will 
have  a  small  constant  value  across  the  specularity.  For 
large  angles  of  incidence,  however,  the  character  of  A 
changes  remarkably  (Figure  111).  In  particular,  for  large 


PIXELS 


Figure  12.  Specular  Intensity  Profile  Across  an  Fdge 


angles  of  incidence  (glancing  incidence)  we  see  that 

1.  A  becomes  large  relative  to  its  value  for  other  inci¬ 
dence  angles  (Figure  13). 

2.  A  causes  a  shift  in  the  peak  of  the  specular  profde 
toward  larger  angles  of  incidence. 

3.  A  causes  the  specular  profile  to  be  unsymmetric  as 
a  function  of  a. 

It  is  not  surprising  that  when  these  clFccts  arc  present  in 
an  image,  they  are  rather  easy  to  detect.  For  this  reason, 
it  is  useful  to  make  qualitative  predictions  about  A  in 
applications  where  large  angles  of  incidence  arc  possible. 


Figure  13.  Plot  of  A  as  function  of  incidence  angle 

!».  The  Laboratory  Setup 

A  laboratory  arrangement  lias  been  set  up  to  test 
the  derived  relationships  (Figure  I  I).  I  his  section  of 
the  paper  describes  the  laboratory  setup.  Section  6  ex¬ 
amines  factors  which  must  be  considered  to  successfully 
interpret  real  images.  In  Section  7,  we  describe  ail  im¬ 
plemented  system  which  has  been  used  to  infer  local 
surface  properties  from  spccularilies.  Section  8  presents 
experimental  results. 


881 


Figure  1  I.  The  Laboratory  Set  tip 


1<»  insure  menrale  nieasiireniriit s,  tin*  experiments 
;  re  concluded  mi  a  l\f»  f»»nl  optical  table.  High  preci¬ 
sion  rotation  and  translation  stages  are  used  to  position 
I  lie  objects  being  viewed.  A  halogen  light  source  with 
a  5  mm  v  ide  Moment  is  plaeed  20  feet  from  tin-  object 
surface  to  approximate  a  point  source.  Monochromalic 
image  data  is  obtained  using  a  \ideo  camera  and  an  im¬ 
age  digitizer.  A  210  mm  lens  is  used  with  tin*  video 
camera  to  **btaivi  high  resolution  across  the  specularity. 

I  he  resulting  images  are  in  the  form  of  25f>x25(>  arrays  of 
pi\e|>.  K,  eh  pixel  lias  eight  hits  of  gray  level  resolution. 

\  precise  positioning  device  lias  been  built  to  position 
the  camera  telative  In  the  surface.  ( 'aiucTa-nb  jeel  dis- 
t  ant  es  of  al  least  21  inches  are  enforced  to  insure  that 
the  assumed  distant  object  condition  is  met.  I -sing  this 
s' tup.  it  i"  possible  to  obtain  more  Ilian  It)  pixels  across 
a  specular  feature  which  is  less  than  a  centimeter  wide 
on  the  surface.  Metal  cylinders  and  spheres  of  varying 
curvature  are  used  to  test  the  predicted  relationships 
(figure  15). 


Figure  15.  Some  Fxpcrimeiilal  Specular  Surfaces 


<>.  Practical  Considerations 

I  his  section  examines  fac  tors  which  must  be  con¬ 
sidered  to  enable  a  shape  from  specularity  system  to 
successfully  interpret  real  images. 

<».J.  Gaussian  Blur 

1'nfnrt  unalcly.  the  formation  of  an  image  by  an  op¬ 
tical  system  introduces  some  amount  of  degradation. 

We  can  model  this  degradation  as  a  convolution  witli 
a  spatially  invariant  Gaussian  point  spread  function  [lj. 

I  he  standard  deviation  of  this  blur  is  typically  less  than 
one  pixel,  for  small  specular  features,  taking  into  ac¬ 
count  the  clfects  of  this  blur  allows  a  more  accurate 

determination  of  surface  shape. 

Our  system  uses  a  module  called  BLIHMNVFRT 
to  drhlur  the  input  specular  image.  For  general  2- 

dimciisioual  functions,  inverting  Gaussian  blur  is  uu  m\- 
slabh-  process.  However,  an  explicit  dehlurriug  convolu¬ 
tion  kernel  has  been  derived  under  certain  assumptions 
in  15].  I  lr.  1-1)  continuous  version  of  the  kernel  is  given 
by 

,=  (.v  D/2 

*C/.v  ( J* )  V  (  I  )*//, ,(,•)  (25) 

v  tl, 

where  N  is  an  mill  integer  denoting  tile  onler  of  the  Vcr- 
IH'I  ami  is  I  In'  llermile  (>•  >1  v i u m i ia I  of  degree  2k. 

larger  values  of  N  give  lielter  dehturriug  filters  (i.o. 
they  rei-ov  t  exactly  a  larger  space  of  blurred  iiijiut  fiiiic- 
tions).  Iiul  are  mure  easily  la  raiil|mli'.  The  value  of  N 
that  is  chosen  in  applicat iaiis  ili-jn-mls  an  the  intensity 
characteristics  af  the  images  that  will  lie  |»raeesscil  l>y 
the  system.  Hsing  a  2-diiucnsinmd  discrete  version  af 
(25),  the  1 1 1  >  l 1  It  IN  V  KI!T  uiailule  allaws  a  1 1  r  system  to 
produce  accurate  shape  dcscripl ions  from  small  speeular 
features  in  images. 

(>.2.  Quantization  EITccts 

On  a  surfaee  of  high  curvature,  it  is  unlikely  that  we 
will  measure  the  correct  maximum  speeular  intensity  I\' 
in  (ti).  The  problem  is  that  lar  highly  curvei!  surfaces  wc 
are  unable  (o  shrink  a  pixel  down  to  where  the  surface 
area  it  images  Is  approximately  planar.  Even  within 
the  single  pixel  of  maximum  intensity,  o  is  changing 
and  cannot  lie  considered  constant.  Hence  the  intensity 
value  at  the  maximum  pixel  will  be  all  average  specular 
intensity  over  a  small  range  of  rx  and  will  not  give  the 
true  A  '  of  (ti).  This  must,  be  corrected  for  in  applica¬ 
tions.  An  artifact  of  this  phenomenon  is  that  measured 
A  '  seems  to  increase  as  surface  curvature  decreases.  It 


882 


follows  that  if  we  wish  to  measure  K'  for  a  material,  we 
should  use  a  surface  of  small  curvature,  ideally  a  plane. 

Since  specularities  are  usually  the  brightest  features 
in  images,  specular  intensities  are  often  too  large  to  be 
represented  in  the  number  of  bits  per  pixel  allowed  by 
the  digitizing  hardware  or  within  the  dynamic  range  of 
the  camera.  If  this  is  the  case,  the  specularity  is  trun¬ 
cated.  figure  Hi  shows  1  for  a  truncated  specularity. 
The  obvious  way  to  deal  with  this  situation  is  to  avoid 
it.  One  avoidance  technique  is  to  take  multiple  images 
in  which  differing  amounts  of  light  are  allowed  to  reach 
the  camera.  This  can  be  achieved  either  by  adjusting 
the  lens  aperture  or  by  using  filters.  Another  possible 
solution  is  to  control  the  illumination  to  eliminate  the 
possibility  of  truncation. 


PIXELS 

Figure  1C.  A  Truncated  Specularity 


the  system. 

7.1.  Overview  of  System  Structure 

At  a  high  level  of  abstraction,  the  problem  is  best 
solved  in  two  steps.  The  first  step  is  to  deblur  the  input 
specular  intensity  image.  The  second  step  is  to  compute 
local  surface  properties  from  the  image  resulting  from 
step  1. 

The  system  designed  to  solve  the  problem  preserves 
this  two  step  structure.  Figure  17  is  a  diagram  of  the 
modules  in  our  system  with  arcs  indicating  module  in¬ 
teractions.  From  this  diagram  we  see  that  there  is  a 
clean  separation  between  the  deblnrring  task  and  the 
task  of  computing  local  surface  properties.  First  the 
main  program  invokes  a  function  called  BLURIN VERT 
to  deblur  lie  input  image.  After  the  deblurring  task 
is  completed,  the  function  CURVATURES  is  called  to 
compute  local  surface  properties.  The  next  two  subsec¬ 
tions  give  overviews  of  the  11LUR INVERT  .and  CUR¬ 
VATURES  functions. 


If  inferences  must  be  made  from  a  single  image, 
then  it  is  arguably  belter  to  allow  truncation  to  occur. 
In  the  case  where  input  images  have  eight  bits  per  pixel, 
intensities  will  range  from  0  to  255.  In  many  applica¬ 
tions  it  is  possible  to  weaken  the  incident  illumination 
so  that  no  truncation  occurs.  In  doing  this,  however,  we 
cause  pixels  on  the  Is  curve  which  previously  had  sig¬ 
nificant  specular  intensities  (on  the  truncated  specular 
feature)  to  have  negligible  specular  intensities.  The  net 
elfecl  of  eliminating  truncation  is  to  decrease  the  width 
of  the  specular  feature  and  make  measurements  more 
susceptible  to  small  errors. 

7.  A  Shape  from  Specularity  System 

A  system  lias  been  implemented  which  computes  lo¬ 
cal  surface  properties  from  images  of  specular  surfaces 
[121.  The  system  currently  stands  alone,  but  will  be  used 
in  the  more  general  context  of  the  SUCCESSOR  vision 
system.  The  shape  from  specularity  system  is  primar¬ 
ily  designed  to  perform  the  computations  described  in 
Section  4.  This  section  describes  Inc  implementation  of 


Figure  17.  System  Structure 
7.2.  Overview  of  BLURINVERT 

The  BLURINVERT  function  is  used  to  deblur  the 
Gaussian  blurred  input  image.  This  is  accomplished  in 
two  stages.  First,  the  deblurring  convolution  kernel  is 
generated  by  DEBLUR.  Then  the  CONVOLVE  function 
is  called  to  perform  the  convolution  of  the  blurred  in¬ 
put  image  with  the  constructed  deblurring  kernel.  Two 
functions  are  used  by  DEBLUR  to  manipulate  the  Hcr- 
mite  polynomials  required  to  generate  the  deblurring 
filter.  The  function  HERM1TE  uses  a  dynamic  pro¬ 
gramming  scheme  to  compute  coefficients  of  all  Ilermite 
polynomials  up  to  some  specified  degree.  The  function 
11HVAL  is  used  to  evaluate  Hcrmitc  polynomials  at  fixed 
values  of  the  polynomial  parameter. 

7.3.  Overview  of  CURVATURES 

Given  a  deblurrcd  specular  intcusity  image,  the 
CURVATURES  function  computes  the  principal  curva¬ 
tures  and  directions  of  the  surface  at  specular  points. 
CURVATURES  first  uses  the  function  LOG  IMAGE  to 


883 


transform  the  intensity  image  into  the  <*  angle  image 
of  Section  4.  I  lie  CURVATURES  function  then  sys¬ 
tematically  computes  t-D  curvature  at  different  direc¬ 
tions  in  the  tangent  space  to  the  surface.  The  function 
PROJECT  is  used  to  compute  the  metric  transform  be¬ 
tween  the  tangent  space  to  the  surface  and  the  image. 
This  is  necessary  to  insure  that  the  system  samples  the 
angle  image  at  equidistant  angles  on  the  surface.  The 
function  SAM  I’M',  is  used  to  sample  the  a  angle  image 
in  a  specified  direction.  Finally,  the  function  UNRAll 
is  used  to  compute  the  least  squares  curvature  given  the 
data  generated  by  PROJECT  and  SAMPLE. 

8.  Experimental  Results 

I  he  system  described  in  Section  7  has  consistently 
generated  accurate  surface  descriptions  from  images  of 
specular  surfaces.  In  this  section  we  give  examples  of  our 
system’s  performance  on  real  images  of  metal  objects  il¬ 
luminated  by  a  point  source.  Figures  18(a),  10(a),  and 
20(a)  are  images  of  circular  cylinders  of  varying  radii. 
Each  cylinder  is  oriented  such  that  its  axis  is  perpen¬ 
dicular  to  the  axis  of  the  imaging  device.  Figure  21(a) 
is  the  image  of  a  sphere.  The  actual  statistics  of  the 
surfaces  are  given  in  Table  1.  The  dotted  lines  in  the 
images  indicate  the  direction  of  maximum  curvature  as 
determined  by  the  system.  Figures  18(b),  10(b),  20(b), 
and  21(h)  are  plots  of  intensity  along  the  dotted  lines 
in  18(a),  10(a),  20(a),  and  21(a).  Note  that  in  Figure 
21  the  specularity  is  truncated,  but  we  are  still  able  to 
compute  accurate  surface  statistics.  Table  2  gives  the 
s  'coud  order  surface  statistics  computed  by  the  system. 
Error  represents  the  percent  of  error  in  the  computation 
of  the  largest  curvature  of  the  surface.  The  small  errors 
can  be  attributed  to  quantization  effects,  noise  intro¬ 
duced  during  the  measurement  process,  and  the  various 
simplifications  made  to  the  specular  model. 


Figure  18(a) 


pixr.i.* 


Figure  18(b) 


Figure  19(a) 


*00 


Figure  21(b) 


Object 

run 

*3 

*3 

cylinderi 

0.286 

0.0 

1.571 

cylinder? 

0.400 

0.0 

0.0 

1.571 

cylinder* 

1.333 

0.0 

0.0 

1.571 

sphere  | 

0.500 

0.500 

— 

Table  1.  Actual  Surface  Statistics 


Object 

*1 

0, 

*2 

07 

lirror 

cylinder 

0.297 

0  0 

0.001 

1.571 

3.9% 

cylinder^ 

0.397 

0.0 

0.001 

1.571 

0.9% 

cylinder , 

1.356 

0.0 

0  002 

1.571 

1.7% 

sphere. 

0.514 

0.0 

0.534 

1.571 

2.8% 

Table  2.  Computed  Surface  Statistics 


0.  Summery  and  Implications 

Understanding  specular  reflection  is  important  for 
any  vision  system  which  must  interpret  a  world  contaiu- 
ing  glossy  objects.  Using  a  model  developed  by  optics 
researchers,  we  have  shown  that  we  can  accurately  pre¬ 
dict  the  appearance  of  specular  features  in  an  image.  In 
addition  w<  have  shown  how  to  compute  the  local  orien¬ 
tation  and  principal  curvatures  and  principal  directions 
of  a  specular  surface  bv  examining  image  intensities  on  a 
specularity.  These  statistics  give  a  complete  local  char¬ 
acterization  of  the  surface  up  to  second  order.  Unlike 
previous  work,  our  derivations  have  included  the  effects 
of  surface  roughness  and  luicroslructiirc  on  the  appear¬ 
ance  of  specular  features. 

A  system  has  been  implemented  which  computes 
local  surface  properties  from  images  of  specular  objects. 
A  laboratory  setup  has  been  arranged  which  allows  us  to 


885 


rupture  images  to  lest  our  system.  The  system  has  con¬ 
sistently  produced  accurate  surface  descriptions  despite 
the  fact  that  the  high  intensity  and  small  spatial  ex¬ 
tent  of  specularities  makes  measurements  difficult.  Sig¬ 
nificant  aspects  of  the  implementation  are  discussed  in 
Section  7.  Examples  of  experimental  results  arc  given 
in  Section  8. 

The  ability  to  predict  intensity-based  features  such 
as  specularities  opens  up  interesting  possibilities  for 
model-based  vision.  Previous  niodel-based  vision  sys¬ 
tems  have  restricted  their  predictions  to  the  structure  of 
edges  which  will  be  observed  for  a  given  model.  An  abil¬ 
ity  to  predict  intensity-based  features  will  significantly 
enhance  the  top-down  capabilities  of  a  model-based  vi¬ 
sion  system.  Clearly  it  is  advantageous  to  be  able  to 
make  stronger  predictions  about  an  image  by  using  addi¬ 
tional  information  about  the  imaging  process.  Another 
important  advantage  of  predicting  intensity-based  fea¬ 
tures  is  that  this  prediction  can  provide  streng  guidance 
to  low  level  intensity-based  visual  processes.  By  making 
predictions  about  the  appearance  of  intensity  patches  in 
an  image  we  can  hope  to  further  unify  the  goals  of  the 
low  level  and  high  level  mechanisms  of  a  model-based 
vision  system. 

More  important  than  bring  able  to  predict  the  ap¬ 
pearance  of  specularities  from  surface  models  is  our  sys¬ 
tem’s  ability  to  invert  the  process.  We  have  shown  how 
to  infer  local  second  order  surface  shape  from  specular 
images.  This  capability  provides  a  vision  system  with 
strong  generic  information  about  a  surface  in  a  scene  us¬ 
ing  strictly  bottom-up  processing.  Inferring  shape  infor¬ 
mation  from  specularity  is  particularly  important  when 
viewing  metal  surfaces  because  other  shape  eucs  such 
as  shading  and  texture  are  often  not  present.  For  other 
kinds  of  surfaces,  shape  information  from  specularity 
can  be  combined  with  shape  information  obtained  us¬ 
ing  other  cues  to  improve  the  3-D  surface  descriptions 
generated  by  a  vision  system. 

Acknowledgements 

This  work  has  been  supported  by  an  NSF  gradu¬ 
ate  fellowship,  AFOSR  contract  F  19620-82-C-0092,  and 
ARPA  contract  N000-39-84-C-0211.  The  authors  would 
like  to  thauk  Professor  Hessclink  lor  generously  provid¬ 
ing  laboratory  space  and  equipment  and  Rami  Rise  for 
his  work  in  designing  the  mechanical  parts  for  the  ex¬ 
periments.  We  would  also  like  to  thank  Vishvjit  Nalwa 
for  providing  useful  comments  on  a  draft  of  this  paper. 

References 

[1]  Andrews,  H.  and  Hunt,  B.,  "Digital  Image  Restora¬ 
tion,"  Prentice-Hall  Inc.,  Englewood  Cliffs,  1977. 


[2]  Blake  A.,  "Specular  Stereo,”  Proceeding t  of  IJCAI-9 
(Los  Angeles:  August  1985),  973-978. 

[31  Blinn,  J.,  “Models  of  Light  Reflection  for  Com¬ 
puter  Synthesized  Pictures,”  Computer  Graphics,  11(2) 
(1977),  192-198. 

[4]  Brooks,  R.,  “Symbolic  Reasoning  Among  3-D  Models 
and  2-D  Images,”  Artificial  Intelligence,  17  (1981),  285- 
348. 

[5]  Bruss,  A.,  “The  Image  Irradiance  Equation:  Its  So¬ 
lution  and  Application,”  MIT  AI  Memo  623,  June  1981. 

[6]  Chclberg,  D.  and  Lim,  H.  and  Cowan,  C., 
“ACRONYM  Model-based  Vision  in  the  Intelligent 
Task  Automation  Project,”  Proceedings  o f  Image  Un¬ 
derstanding  Workshop  (1984). 

[7]  Coleman,  E.N.  Jr.  and  Jain,  R.,  “Obtaining  3-D 
Shape  of  Textured  and  Specular  Surfaces  Using  Four- 
Source  Photometry,”  Computer  Graphics  and  Image 
Processing,  18  (1982),  309-328. 

[8]  Cook,  R.  and  Torrance,  K.,  “A  Reflectance  Model  for 
Computer  Graphics”,  Computer  Graphics,  15(3)  (1981), 
307-316. 

[9]  Deift,  P.,  and  Sylvester,  J.,  “Some  Remarks  on 
the  Sliape-from-Shading  Problem  in  Computer  Vision,” 
Journal  of  Mathematical  Analysis  and  A pplicc lions  ,  Vol. 
84,  No.  1,  pp. 235-248,  November,  1981. 

[10]  Foley,  J.  and  Van  Dam,  A.,  Fundamentals  of  Inter¬ 
active  Computer  Graphics  ,  Addison-Wesley,  1982. 

[11]  Crimson,  W.E.L.,  “Binocular  Shading  and  Visual 
Surface  Reconstruction,"  MIT  AI  Memo  697  (1982). 

[12]  Healey,  G.,  “Solving  the  Inverse  Specular  Problem”, 
Stanford  Computer  Science  Department  Programming 
Project,  March  1986. 

[13]  Horn,  B.,  “Obtaining  Shape  from  Shading  Infor¬ 
mation”,  in  “The  Psychology  of  Computer  Vision”  (P. 
Winston,  Ed.),  McGraw-Hill,  New  York,  1975. 

[14]  Horn,  B.,  Robot  Vision,  MIT  Press,  1986. 

[15]  Hummel,  R.  and  Kimia,  B.  and  Zucker,  S.,  “Gaus¬ 
sian  Blur  and  the  Heat  Equation:  Forward  and  Inverse 
Solutions,”  Proceedings  of  CVPR  (San  Francisco:  June, 
1985). 

[16]  Ikcuchi,  K.,  “Determining  Surface  Orientations  of 
Specular  Surfaces  by  Using  the  Photometric  Stereo 
Method”,  IEEE  PAMI  3(6)  (1981),  661-669. 

[17]  Kreyszig,  E.,  Introduction  to  Differential  Geometry 
and  Riemannian  Geometry ,  University  of  Toronto  Press, 
1968. 

[18]  Phong,  B.,  “Illumination  for  Computer  Ccneralcd 
Pictures,"  Communication)  of  the  ACM  18  (1975),  311- 
317. 


886 


[19]  Shafer,  S.,  “Optical  Pheuutiicun  in  t  ‘output it  \  i- 
sion,”  Univ.  of  Rochester  I'll  i;l.r»  (198-1). 

[20]  Spivak,  M„  IhJJi  rrutial  Geometry  -  \'o!utnr  II,  Pub¬ 
lish  or  IVrisli,  Inc.,  1979. 

[21]  Sparrow,  F.  and  Cess,  R.,  Uiulintwn  llmt  Transfer , 
McGraw-Hill,  New  York,  1978. 

[22]  Tnkai,  K.  ami  Kitnura,  F.  and  Sata,  T.,  “A  Fast 
Visual  Recognition  System  of  Mechanical  Parts  by  Use 
of  ritree  Dimensional  Model,"  source  unknown.  (First 
author  is  with  CANON  INC.  in  Tokyo.) 

[23]  Torrance,  K.  and  Sparrow,  K.,  “Theory  for  OIF- 
Specular  Reflection  from  Roughened  Surfaces,”  Journal 
of  the  Optical  Society  of  America  .  57  (1907),  1 1 05- 1114. 

[24]  Wood  hum,  R.,  “Photometric  Stereo:  A  Reflectance 
Map  Technique  for  Determining  Surface  Orientation 
from  Image  Intensity,”  Proc.  SPiF,  vol.  l.r>5  (1978). 


FINDING  OBJECT  BOUNDARIES  USING 
GUIDED  GRADIENT  ASCENT 


Yvan  Leclerc  and  Pascal  Fua 
Artificial  Intelligence  Center,  SRI  International 
333  Ravenswood  Avenue,  Menlo  Park,  California  94025* 


Abstract 

We  present  a  technique  for  finding  object  boundaries  that  com¬ 
bines  both  low-  and  high-level  information.  We  integrate  infor¬ 
mation  along  a  candidate  boundary  and  incrementally  deform 
it,  using  a  guided  gradient  ascent  procedure,  until  a  satisfactory 
curve  is  found.  The  strength  of  this  technique  lies  in  our  ability 
to  guide  the  search  by  applying  forces  to  the  boundary,  guiding 
its  change  in  shape  and  location.  This  technique  overcomes  in¬ 
herent  image  ambiguities,  thereby  finding  boundaries  that  could 
not  otherwise  be  found. 

1  Introduction 

In  real-world  images,  physical  event  discontinuities  cannot  be 
detected  solely  by  their  local  statistical  signature  in  the  image 
because  of  the  presence  of  noise  and  various  photometric  anoma¬ 
lies.  All  methods  for  finding  discontinuities  that  use  purely  lo¬ 
cal  statistical  criteria  are  therefore  bound  to  make  mistakes  and 
miss  some  of  the  weaker  parts  of  edges. 

To  supplement  the  weak  and  noisy  local  information,  we  must 
integrate  information  from  the  many  points  that  lie  on  an  edge 
without  introducing  irrelevant  and  often  misleading  information 
from  other  points.  We  propose  to  do  this  by  integrating  infor¬ 
mation  along  a  candidate  curve  and  incrementally  deforming  it, 
using  a  guided  gradient  ascent  procedure,  until  a  satisfactory 
curve  is  found.  The  strength  of  our  procedure  lies  in  the  ability 
to  guide  the  search,  perhaps  on  the  basis  of  semantic  informa¬ 
tion,  by  applying  forces  to  the  curve  to  guide  its  change  in  both 
shape  and  location. 

In  the  next  section,  we  describe  our  model  of  an  ideal  step- 
edge.  We  then  describe  the  guided  gradient  ascent  procedure, 
using  the  search  for  a  straight  edge  as  an  example.  Finally,  we 
present  applications  of  the  procedure  to  automated  cartography 
and  object  location. 

2  Our  Model  of  An  Ideal  Step-Edge 

We  define  an  ideal  step-edge  to  be  a  simple  one-dimensional 
curve  C  for  which  the  following  properties  hold  everywhere  along 
the  curve  (Haralick,  1984]: 

1.  The  magnitude  of  the  directional  first-derivative  of  image 
intensity  normal  to  the  curve  is  a  local  maximum  in  that 
direction;  i.e. ,  the  directional  second-derivative  of  image  in¬ 
tensity  normal  to  the  curve  is  zero. 


2.  The  magnitude  of  the  directional  first-derivative  of  image 
intensity  tangential  to  the  curve  is  zero. 

Ideally,  we  wish  to  define  a  functional  7{C\  I)  such  that  it  has  lo¬ 
cal  maxima  with  respect  to  infinitesimal  deformations  of  C  when 
and  only  when  these  properties  hold.  Given  such  a  functional, 
we  can  then  formulate  the  problem  of  finding  an  ideal  edge  as 
one  of  maximizing  7  using  gradient  ascent  methods. 

We  can  define  such  a  functional,  but  it  is  expensive  to  compute 
because  it  involves  calculating  the  directional  first-derivative  of 
image  intensity  at  every  point  along  the  curve  as  a  function  of 
its  tangent.  Instead,  we  define  a  functional  that  is  much  less 
expensive  to  compute,  but  that  depends  on  conditions  that  are 
only  necessary  for  a  curve  to  be  an  ideal  step-edge.  Thus,  once 
a  candidate  curve  has  been  found  by  maximizing  this  simpler 
functional,  we  must  test  it  using  the  more  expensive  and  reliable 
criteria  specified  above. 

Two  necessary,  but  not  sufficient,  conditions  for  a  curve  C  to 
be  an  ideal  step-edge  are: 

1.  the  average  magnitude  of  image  intensity  gradient 
(| V I (z,  y)|)  along  C  must  be  a  local  maximum  with  respect 
to  infinitesimal  deformations  of  C,  and 

2.  the  average  variance  of  the  magnitude  of  image  intensity 
gradient  along  C  must  be  a  local  minimum  with  respect  to 
infinitesimal  deformations  of  C; 

with  the  corresponding  functionals: 

=  JjVI(x,y)\ds/\C\ 

7v[C',I)  =  j  IVJ(x,y)  -  /r|2  ds/  |C|, 

fi  =  7m{C  J),  |C|  =  length  of  C. 

In  principle,  one  should  be  able  to  use  either  one  or  the  other 
functional  alone,  since  they  correspond  to  necessary,  but  not  suf¬ 
ficient,  conditions.  In  practice,  however,  using  either  one  alone 
produces  too  many  local  extrema  that  do  not  satisfy  the  full 
conditions.  An  intuitive  explanation  for  this  is  that  maximiz¬ 
ing  7m[C\  I)  alone  can  produce  curves  that  span  several  disjoint 
edges  whose  average  magnitude  might  be  quite  high,  whereas 
minimizing  7„(C;  I)  alone  can  produce  arbitrary  curves  over  en¬ 
tirely  uniform  areas  of  the  image.  For  these  reasons,  we  have 
chosen  to  maximize  the  combination 

m,]‘\ MS' 


888 


which  almost  always  produces  satisfactory  curves  when  started 
sufficiently  near  an  edge.  Choosing  a  large  a  tends  to  force 
intermediate  curves  to  be  approximately  parallel  to  the  final 
curve,  providing  stability  in  the  search.  However,  a  should  be 
chosen  to  be  small  enough  that  somewhat  nonideal  edges,  which 
are  prevalent  in  real  images,  can  still  be  found.  In  practice, 
a  =  0.2  produces  satisfactory  results  in  most  situations. 

The  great  advantage  of  this  functional  over  the  ideal  one  is 
that  it  can  be  computed  very  cheaply  because  |  V/(x,  y)  |  need  be 
computed  only  once  for  every  image  point,  independently  of  the 
curve.  Thus,  we  can  approximate  the  integrals  by  summing  over 
a  precomputed  gradient-magnitude  image  such  as  that  produced 
by  a  Sobel  operator.  Furthermore,  in  the  final  stages  of  the 
procedure  described  in  the  next  section,  we  can  interpolate  the 
results  of  the  Sobel  operator  to  produce  subpixe!  estimates  of  the 
gradient-magnitude,  and  hence  subpixel  estimates  of  the  edge. 

3  The  Guided  Gradient  Ascent  Proce¬ 
dure 

We  are  interested  in  finding  curves  that  maximize  7[C\  I)  given 
an  initial  estimate  of  the  curve’s  location  and  shape  (Witkin 
et  al.,  1986].  For  a  curve  defined  by  a  number  of  geometric 
parameters,  we  can  do  this  by  using  gradient  ascent  given  initial 
estimates  of  these  parameters. 

The  unguided  gradient  ascent  algorithm  iteratively  takes  steps 
of  a  given  size  in  the  direction  of  the  gradient.  As  the  iter¬ 
ations  proceed,  the  step-size  is  reduced  whenever  7  decreases 
and  when  certain  quality  criteria,  such  as  7m  being  greater  than 
some  minimal  value,  are  satisfied.  The  latter  criterion  is  im¬ 
portant  to  avoid  stopping  at  weak  local  maxima,  which  occur 
relatively  frequently  because  of  noise.  The  procedure  is  stopped 
when  the  step-size  becomes  smaller  than  some  minimal  value, 
typically  corresponding  to  some  fraction  of  a  pixel  for  position 
and  a  few  degrees  for  orientation. 

For  example,  consider  the  case  of  a  straight  line-segment  of 
fixed  length.  Figure  1(a)  shows  the  image  that  we  will  be  using 
for  all  of  our  examples,  and  Figure  1(b)  shows  the  corresponding 
gradient-magnitude  image. 

Figure  2(a)  shows  such  a  line-segment  placed  near  a  straight 
edge  in  an  image.  Figure  2(b)  shows  an  intermediate  step  as 
the  gradient  ascent  procedure  iterates,  with  the  final  result  in 
Figure  2(c). 

Of  course,  this  procedure  crucially  depends  on  being  relatively 
near  the  correct  final  result.  For  example,  if  we  were  to  place 
the  initial  line-segment  in  a  uniform  area,  it  would  oscillate  and 
might  never  stop. 

The  strength  of  our  approach,  however,  lies  in  our  ability  to 
guide  the  procedure  when  some  a  priori  semantic  or  geometrical 
information  is  known.  This  is  done  by  applying  forces  to  the 
curve  by  multiplying  the  base  functional  by  forcing  functionals 
that  depend  on  the  parameters  of  the  curve,  but  not  on  the 
image. 

To  continue  our  previous  example  using  forcing  functionals, 
we  want  to  find  an  edge  parallel  to  the  one  in  Figure  2(c).  We  can 
guide  the  procedure  by  applying  a  force  to  each  of  the  end-points 
in  the  direction  perpendicular  to  the  original  line  (multiplying  7 
by  the  length  of  the  vector  that  is  perpendicular  to  the  original 
line  passing  through  the  end-point).  By  starting  with  a  line- 
segment  close  to  the  one  in  Figure  2(c),  and  applying  this  force, 


the  system  found  the  parallel  edge  illustrated  in  Figure  3.  Note 
that  the  gradient-magnitude  of  this  edge  is  extremely  weak,  and 
would  therefore  have  been  missed  entirely  by  local  techniques. 

4  Applications  of  the  Approach 

We  have  implemented  several  more  sophisticated  applications  of 
the  guided  gradient  ascent  approach,  of  which  only  two  can  be 
briefly  discussed  here. 

The  first  application  is  to  automated  cartography.  In  an  aerial 
image,  the  height  of  a  building  can  be  estimated  by  using  the 
known  sun  direction,  the  position  of  one  of  its  corners,  and  its 
corresponding  shadow  corner.  In  order  to  build  an  interactive 
cartographic  system  that  will  perform  this  task  automatically, 
we  want  to  allow  the  user  to  point  near  a  corner  and  have  the 
system  find  its  exact  location  and  orientation. 

The  procedure  we  have  implemented  starts  by  having  the  user 
point  inside  a  building  or  shadow  near  a  corner.  A  rough  esti¬ 
mate  of  the  directions  of  the  edges  is  computed  by  finding  the 
two  main  peaks  in  the  histogram  of  Sobel  directions  in  a  win¬ 
dow  around  this  point.  These  two  directions  are  used  to  split 
the  window  into  four  quadrants,  corresponding  to  the  four  pos¬ 
sible  orientations  of  the  corner.  If  the  corner  is  within  the  win¬ 
dow,  three  of  the  quadrants  overlap  its  edges  and  only  one  is 
within  the  object  or  the  shadow.  A  simple  area-based  measure 
determines  which  quadrant  is  most  uniform,  and  hence  entirely 
within  the  object  or  shadow.  Figure  4(a)  shows  the  initial  guess 
computed  for  two  user-supplied  points. 

The  initial  guess  is  fed  to  the  optimization  algorithm,  with 
forces  applied  to  the  candidate  corner  such  that  its  edges  are 
pushed  outward.  The  edges  are  coupled  by  a  force  proportional 
to  the  angle  of  the  corner.  This  prevents  one  edge  from  stopping 
in  a  local  minimum  when  the  other  has  not  yet  stabilized.  Fig¬ 
ure  4(b)  shows  the  final  result,  in  which  the  corners  have  been 
accurately  found  although  the  initial  points  were  quite  distant 
from  the  actual  corners. 

The  second  application,  described  in  detail  in  this  proceed¬ 
ings  'Fua  and  Hanson,  1987],  is  to  integrate  this  approach  in 
a  fully  automated  system  for  the  detection  and  location  of  cul¬ 
tural  objects  in  aerial  images.  Briefly,  the  system  hypothesizes 
the  shape  of  objects  by  combining  an  Ohlander-style  segmenta¬ 
tion  with  geometric  models  and  semantic  information  about  the 
expected  class  of  objects.  The  guided  gradient  ascent  is  then 
used  to  improve  and  verify  these  predictions.  The  technique  has 
allowed  the  system  to  discover  boundaries  that  were  entirely 
missed  by  both  the  segmentation  and  local  edge  operators. 

5  Conclusions 

We  have  presented  a  technique  for  finding  object  boundaries 
given  rough  initial  estimates  of  their  shape  and  position.  The 
power  of  this  technique  lies  in  its  ability  to  combine  low-  and 
high-level  information.  In  many  scenes,  the  ambiguities  of  the 
data  are  such  that  shapes  cannot  be  extracted  reliably  by  the 
sole  use  of  low-level  cues.  Our  technique  allows  us  to  supple¬ 
ment  this  weak  information  with  higher-level  knowledge,  such 
as  the  expected  shape  of  the  boundaries,  which  is  used  to  de¬ 
termine  the  shape  of  the  operator  and  the  guiding  forces.  This 
additional  information  allows  the  system  to  overcome  inherent 
imnee  ambiguities  and  find  shapes  that  could  not  otherwise  be 
found. 


089 


References 

P.  Fua  and  A.  Hanson,  “Using  Generic  Geometric  Models  for 
Intelligent  Shape  Extraction,”  this  proceedings,  1987, 

R.M.  Haralick,  “Digital  Step  Edges  from  Zero  Crossings  of  Sec¬ 
ond  Directional  Derivatives,”  IEEE  Trans.  PAMI,  Vol.  1, 
pp.  58-68,  1984. 

A.  Witkin,  M.  Kass,  D.  Terzopoulos,  and  A.  Barr,  “Modelling 
with  Dynamic  Constraints:  Linking  Graphics  to  Percep¬ 
tion,”  Proc.  of  the  Rank  Prize  Funds  International  Sympo¬ 
sium  on  Images  and  Understanding,  Royal  Society,  London, 
England,  September,  1986. 


(a)  (b) 


Figure  1:  (a)  The  original  image  (b)  The  gradient  image 


(c) 


Figure  2:  (a)  The  initial  guess  for  the  edge 

(b)  An  intermediate  step 

(c)  The  final  location  of  the  edge 


Figure  3:  A  weak  parallel  edge 


(b) 


Figure  4:  (a)  The  initial  guess  for  the  corners 
(b)  The  final  location 


The  work  reported  here  was  partially  supported  by  the  Defense 
Advanced  Research  Projects  Agency  under  contracts  DACA76-85-C-0004 
and  MDA903-86-C-0084. 


DETECTING  BLOBS  AS  TEXTONS  IN  NATURAL  IMAGES 


Harry  Voorhees  and  Tomaso  Poggio 


Massachusetts  Institute  of  Technology 
Artificial  Intelligence  Laboratory 


Abstract.  Texture  provides  one  cue  for  identifying  the 
physical  cause  of  an  intensity  edge,  such  as  occlusion, 
shadow,  surface  orientation  or  reflectance  change.  Marr, 
Julesz,  and  others  have  proposed  that  texture  is  represented 
by  small  lines  or  blobs,  called  “textons”  by  Julesz  [1983],  to¬ 
gether  with  their  attributes,  such  as  orientation,  elongation, 
and  intensity.  Psychophysical  studies  suggest  that  texture 
boundaries  are  perceived  where  distributions  of  attributes 
over  neighborhoods  of  textons  differ  significantly. 

However,  these  studies,  which  deal  with  synthetic  im¬ 
ages,  neglect  to  consider  an  important  question:  How  can 
these  textons  be  extracted  from  images  of  natural  scenes? 
This  paper  proposes  an  answer  to  this  question  by  propos¬ 
ing  an  algorithm  for  detecting  blobs  in  natural  images.  As 
part  of  the  blob  detection  algorithm,  methods  for  estimat¬ 
ing  image  noise  are  presented,  which  are  applicable  to  edge 
detection  as  well. 


1  Introduction 

A  crucial  step  in  the  visual  process  is  the  identification  of 
physical  discontinuties  in  the  scene — sharp  changes  in  sur¬ 
face  depth  and  orientation,  reflectance,  and  illumination. 
All  of  these  phenomena  can  result  in  intensity  edges,  so 
other  processes  besides  edge  detection  are  needed  to  iden¬ 
tify  these  events.  Texture,  along  with  color  and  motion, 
provides  a  cue.  As  discussed  by  Riley  [1981],  drastic  tex¬ 
ture  changes  can  only  be  explained  by  occlusion,  while 
projection-type  changes  can  be  due  to  a  surface  orienta¬ 
tion  discontinuity  as  well.  It  is  with  this  goal  in  mind — the 
interpretation  of  physical  discontinuities — that  we  investi¬ 
gate  the  visual  processing  of  texture  information. 

In  his  theory  of  the  primal  sketch,  Marr  [1976]  proposed 
that  texture  is  represented  in  the  raw  primal  sketch — a 
symbolic  description  of  localized  intensity  changes  in  the 
image,  including  edges,  compact  blobs,  and  linear  blobs 
called  bars.  Attributes  of  these  tokens-their  size,  orienta¬ 
tion,  contrast,  and  termination  points-are  computed,  along 
with  their  position.  Texture  boundary  detection  and  the 
extraction  of  form  proceeds  by  comparing  first  order  statis¬ 
tics  of  distributions  of  these  attributes — size,  orientation, 
contrast,  and  density — over  nearby  neighborhoods. 


Since  the  primal  sketch  theory  was  proposed,  a  number 
of  psychophysicists  have  studied  human  texture  perception. 
Julesz  (who  had  previously  held  the  opposite  view)  demon¬ 
strated  that  first  order  local  statistics  of  blobs  better  ac¬ 
count  for  perceived  texture  differences  than  second  order 
global  statistics  of  pixels  [Julesz,  1983].  He  concluded  that 
blobs  called  “textons”— together  with  their  geometric  and 
intensity  attributes,  termination  points,  and  crossings — 
form  the  primitives  on  which  texture  perception  is  based. 
Many  psychophysical  studies  of  texture  perception  have  as¬ 
sumed  the  existence  of  texton  detectors,  employing  syn¬ 
thetic  images  of  symbols  such  as  line  segments. 

However,  these  psychophysical  studies,  which  are  based 
only  on  synthetic  images,  neglect  to  consider  an  important 
issue:  How  can  such  textons  be  reliably  extracted  from  im¬ 
ages  of  natural  scenes?  Here  we  attempt  to  to  answer  this 
question  by  describing  an  algorithm  we  have  implemented 
which  detects  small  blobs  in  natural  images. 

2  Noise  estimation 

When  dealing  with  images  of  natural  scenes,  it  is  necessary 
to  estimate  the  amount  of  noise  in  an  image.  In  edge  or 
blob  detection,  the  noise  estimate  determines  a  threshold 
above  which  an  intensity  change  is  considered  “significant  .” 
In  this  section,  we  show  how  noise  can  be  estimated  from  a 
histogram  of  the  image  convolved  with  a  gaussian  or  lapla- 
cian  of  gaussian  filter,  depending  on  which  is  used  for  edge 
or  blob  detection. 

2.1  Gaussian  filtering 

If  one  assumes  that  the  smoothed  image  f(x,y )  =  7  *  G 
consists  of  white  gaussian  noise, 

Pr(z)  =  G(<r;z)  =  ^e^'i)!  (1) 

then,  using  the  probablility  law  for  a  function  of  random 
variables,  the  partial  derivatives  fx  and  /v  have  p.d.f.’s 

Pr(z)  =  Pr (z)  =  r  Pr(z)  Pr(z  -  s)ds  =  G(«;  z),  (2) 

i*  Jy  J  —  OO  /  J 

where  t  =  \/2 a.  Again  using  the  probability  law,  the 


892 


Figure  1:  The  Rayleigh  distribution 
(a)  Rayleigh  distribution  and  (6)  its  first  derivative. 


p.d.f.  of  the  magnitude  of  the  gradient  (|V/I|  —  yjf‘  +  f~ 

•s  t1r 

Pr  (m)  =  z  I  Pr(m  cos  0)  Pr(m  sin  0)d9 
IIV/h'  Jo  /»  /, 

=  mG(<;  m),  for  m  >  0.  (3) 

The  magnitude  of  gradient  distribution  is  a  Rayleigh 
distribution,  graphed  in  Figure  1,  with  a  maximum  value 
at  m  =  ?.  The  addition  of  real  edges,  whose  gradients  are 
stronger  than  those  of  the  noise  distribution,  affect  mainly 
the  tail  of  the  distribution,  and  do  not  significantly  affect 
the  location  of  the  peak.  Bracho  and  Sanderson  [1985],  who 
applied  this  idea  to  region  growing,  noted  that  the  assump¬ 
tion  holds  as  long  as  a  large  portion  of  the  image  consists 
of  roughly  uniform  intensity  which  satisfies  the  white  noise 
model.  Often  the  background  of  the  image  is  such  a  region. 
In  this  case,  the  noise  can  be  estimated  by  constructing  a 
histogram  h(m)  of  ||V/||,  and  simply  measuring  the  loca¬ 


tion  of  the  peak  as  <.  Figure  2  shows  an  example  for  which 
this  simple  method  works  well. 

While  this  strategy  works  well  for  images  containing 
some  uniform  regions,  it  can  fail  when  the  image  contains 
many  strong  texture  edges,  as  shown  in  Figure  3(a).  In  this 
case,  the  many  edges  of  strong  gradient  obliterate  the  the 
local  maximum  of  h(m)  due  to  the  noise  distribution.  It  is 
therefore  preferable  to  fit  the  Rayleigh  distribution  to  the 
steep,  rising  portion  of  the  histogram. 

The  fitting  can  be  done  easily  by  using  the  first  deriva¬ 
tive  of  the  smoothed  histogram,  h'(m).  Either  the  location 
where  h!  crosses  zero  or  where  it  “would  have  crossed”  zero, 
whichever  quantity  is  less,  estimates  c.  The  latter  position 
is  computed  by  linearly  extrapolating  the  downward  slop¬ 
ing  portion  of  h',  as  shown  in  Figure  3(c).  The  former  case, 
of  course,  corresponds  to  the  first  local  maximum  of  h(m), 
which  is  used  as  the  estimate  for  images  satsifying  the  uni¬ 
form  region  assumption. 

Once  ?  is  estimated,  one  can  choose  a  threshold  to  ex¬ 
clude  most  edges  due  to  noise.  To  remove  noise  with  a 
confidence  of  c,  requires  a  threshold  r  such  that 


fo  zG(c;  z)dz 
Jo°  zG(f,z)dz' 


(4) 


yielding  r/c  =  y/—  2 In  c.  Hence,  removing  noise  with  a 
confidence  of  99%  requires  r/c  %  3. 

Figures  3(f)  and  2(c)  demonstrate  that  this  improved 
method  works  for  images  consisting  entirely  of  highly  tex¬ 
tured  images,  as  well  as  for  images  containing  untextured 
areas.  In  these  examples,  we  used  the  edge  detector  of 
Canny  [1983],  thresholding  the  edges  with  hysteresis,  using 
T]/<  =2  and  r2  =  2 Tj.  In  many  applications,  it  is  desirable 
to  remove  weak  “real”  edges  as  well,  in  which  case  higher 
threshold  factors  are  used.  The  examples  also  show  that 
simply  choosing  a  fixed  percentile  of  the  histogram  does 
not  provide  a  robust  estimate  of  noise. 


Figure  2:  Simple  Noise  Estimation  Example 

(a)  Image  of  mouse,  containing  smooth  intensity  surfaces.  (6)  Magnitude  of  gradient  histogram  with  Rayleigh  distribution  fit 
to  peak  at  m  =  «.  (c)  Edges  detected  using  hysteresis  with  tj  =  2c,  t2  =  4c. 


893 


(a)  ( d )  («)  if) 


Figure  3:  More  Complicated  Noise  Estimation  Example 

Since  this  image  (a)  contains  no  smooth  intensity  surfaces,  fitting  the  Rayleigh  distribution  to  the  global  maximum  of  the 
gradient  histogram,  h(m)  ( b )  results  in  too  high  a  edge  threshold  (e).  A  better  fit  is  obtained  (d)  by  extrapolating  the  first 
derivative  of  the  smoothed  histogram  (c),  resulting  in  a  better  edge  threshold  (f).  Edge  images  (e)  and  (f)  were  thresholded 
with  hysteresis  with  rj  =  2c,  r 2  =  4c- 


2.2  Laplacian  of  Gaussian  filtering  and  blob  detec-  3  Blob  detection 

tion 

For  blob  detection  it  is  natural  to  use  the  sign  bits  of 
For  laplacian  of  gaussian  edge  or  blob  detection,  one  is  /  *  V2G.  In  this  sense  blobs  may  be  regarded  as  duals 

interested  in  choosing  a  non-zero  threshold  t,  to  exclude  of  edges,  which  can  be  detected  using  the  zero  crossings  of 

most  responses  due  to  noise.  If,  as  before,  we  assume  the  /  *  V2G.  Instead  of  thresholding  at  zero,  however,  a  small 

smoothed  image  /  =  /  »  G  consists  of  white  gaussian  noise  threshold  is  used  which  removes  some  connections  between 

with  standard  deviation  <7,  then  the  p.d.f  of  V2/  =  fzx  +  /w  blobs  due  to  noise.  The  threshold  is  positive  for  dark  blobs 

is  Prv2/(z)  =  G(c;z),  where  c  =  4<r.  In  this  case  the  his-  (and  negative  for  light  blobs),  and  its  magnitude  is  pro¬ 
togram  of  V2/  is  p.  gaussian  distribution.  Again,  real  edges  portional  to  the  noise  estimate.  In  the  examples  here,  we 

mainly  affect  the  tails  of  the  distribution,  so  the  gaussian  identify  dark  blobs  where  I  *  V2G  >  £0,  where  t0  ss  c. 

distribution  is  fit  to  only  the  central  portion  of  the  his-  For  the  purpose  of  relating  texture  to  the  properties 

togram.  °f  physical  surfaces,  it  is  desirable  to  discount  the  effects 

The  relationship  between  threshold  r  and  confidence  of  illumination.  In  the  simplest  model  of  image  forma- 

level  c  is  given  by  the  standard  normal  distribution.  Avoid-  tion,  brightness  is  roughly  a  multiple  of  reflectance  and 

ing  responses  due  to  noise  with  c  =  99%,  for  example,  re-  incident  illumination,  b(x,y)  =  r(x,y)i(x,y)  [Horn,  1986]. 

quires  r/c  =s  2.3.  The  presence  of  a  shadow  decreases  the  brightness  differ¬ 

ence  between  a  texton  and  its  background  and  therefore 
decreases  the  response  of  /  *  V2G  over  the  shadowed  re¬ 
gion.  Hence,  if  pixel  values  are  linear  with  brightness,  shad¬ 
ows  can  cause  texture  density  or  contrast  boundaries,  as  1 


i 


894 


(«) 


(c) 


«••«  '-Cn  »v««»  •  r(i  V*  •  /*•  % i 

•  •  *•  -  —  •*  ■»  “  **•■*  r  •*  .  ~  .  u  <s  h*  **-«-.  i  *  +  «  #  ••!*  * 

ci  **-.*•%*  ••->  *  r  _  . w"  V~-- 


.  ...  !>•■• 
r  w'#:iti'r*«»*.o  ♦  */^br  •* 

r  I  '*•  w.-4^  **  *  *  ?  rv 

m  •  ■»•*•  •  ^  A  >»,rf- 

»•/:»" 

<«0 


Figure  4:  Discounting  Effects  of  Dlumination 

(a)  Image  /  of  finger  casting  a  shadow  on  burlap.  Intensity  values  are  approximately  linear  with  brightness,  (b)  The  shadow 
reduces  the  density  of  blobs  if  1  «  V:G  is  used,  (c)  Logarithm  of  image,  log/,  (d)  Density  of  blobs  remains  the  same  if 
(log/)  *  V2G  is  used- 


shown  in  Figure  4(a,  b).  This  problem  can  be  partially 
avoided  by  using  image  values  proportional  to  the  loga¬ 
rithm  of  brightness.  Then  differences  in  the  image  values 
are  proportional  to  ratios  of  brightness  values,  which  are 
illumination-independent: 

log  6,  -  log  b2  =  log  ~  =  log  — ,  (5) 

o2  r  2 

where  hi,  f>2,  ru  and  r2  are  the  brightness  and  reflectance 
values  of  a  texton  and  its  surround.  As  a  result,  shadows 
do  not  affect  the  density  and  contrast  of  blobs,  as  shown 
in  Figure  4(c,  d).  Implementing  such  a  scheme,  of  course, 
requires  that  cameras  be  calibrated.  Most  cameras  have 
neither  purely  linear  nor  logarithmic  response,  suggesting 
the  use  of  a  look-up  table  for  transforming  their  output  to 
/  =  log  b. 

3.1  Blob  segmentation 

Although  some  noise  has  been  removed,  one  cannot  simply 
■  treat  connected  components  of  the  thresholded  V2/ *  G  ar- 
*  ray  as  textons,  since  these  regions  can  span  large  distances 
:  in  the  image.  In  order  to  compute  meaningful  attributes  of 
;  blobs,  such  regions  must  be  segmented  into  localized  blobs 


and  bars.  Thin  joins  between  otherwise  compact  bars  must 
be  eliminated.  Also,  elongated  regions  must  be  separated 
at  bends  and  intersections  into  roughly  linear  bars.  Simple 
procedures  for  performing  these  two  types  of  segmentation 
are  described  next. 

3.1.1  Compact  blob  segmentation 

Often,  for  a  particular  level,  two  nearby  extrema  of  /  *  V2G 
may  be  joined  into  a  single  blob,  containing  a  thin  neck. 
These  connections  are  sensitive  to  the  level  chosen  and  may 
result  in  connected  regions  which  span  large  portions  of  the 
image.  Such  regions  are  segmented  using  multiple  levels. 
Whenever  a  higher  level  would  result  in  a  region  being  split 
into  two  or  more  regions,  we  use  the  new  regions  provided 
that  they  are  relatively  stable.  The  stability  criterion  (sim¬ 
ilar  to  the  stability  criterion  for  fingerprints  proposed  by 
Witkin  [1983])  requires  that  the  new  regions  exist  over  at 
least  as  large  a  range  of  levels  than  the  underlying  one.  The 
splitting  operation  is  applied  recursively  at  several  levels. 
Finally,  the  new,  smaller  regions  are  numbered  and  then 
grown  back  over  the  original  regions  in  order  to  segment 
them.  The  process  is  illustrated  in  Figure  5. 


895 


3.1.2  Bar  segmentation 

Elongated  regions  can  be  segmented  at  junctions  and  bends 
by  using  a  refinement  of  the  medial  axis  transform.  The 
medial  axis,  obtained  by  thinning  the  region  down  to  its 
skeleton,  has  been  criticized  for  being  sensitive  to  slight 
perturbations  in  the  region  contour  (see  Figure  6).  How¬ 
ever,  a  stable  representation  can  be  obtained  by  keeping 
only  those  segments  of  the  skeleton  which  represent  elon¬ 
gated  sections  of  the  region. 

The  pruning  is  accomplished  as  follows:  Straight  line 
segments  are  fit  to  the  skeleton,  with  knot  points  placed 


(°) 


Figure  5:  Segmenting  compact  bars  using  multiple  levels 

(a)  The  surface  I  *  V2G  sliced  at  levels  fj  and  lj.  (b)  The  blob 
at  is  split  into  two  blobs  at  l a.  (c)  The  blobs  at  /j  are  labelled 
and  grown  over  the  original  blob  to  segment  it. 


Figure  6:  A  More  Robust  Medial  Axis  Transform 

The  medial  axis  transform  of  a  rectangle  (a)  is  undesirably  sensi¬ 
tive  to  slight  changes  in  the  bounding  countour  (6)  [From  Man, 
1982,  p.  304].  Fitting  straight  line  segments  to  the  skeleton  and 
removing  those  which  are  short  relative  to  their  distance  from 
the  perimeter  yields  a  more  robust  representation  (c). 


at  any  intersections  as  well.  The  average  distance  along 
each  corresponding  skeleton  segment  to  the  perimeter  of 
the  region  is  computed,  and  only  segments  whose  lengths 
are  longer  than  a  times  their  radii  are  retained  (we  use 
a  —  3,  which  corresponds  to  an  minimum  aspect  ratio  of 
a/2  =  3/2).  At  this  stage,  a  maximum  aspect  ratio  can 
be  imposed  on  textons,  if  desired,  by  splitting  very  long 
segments.  The  algorithm  removes  insignificant  skeleton 
segments,  retaining  only  those  which  represent  elongated, 
bar-like  parts  of  the  region.  The  remaining  segments,  num¬ 
bered,  are  then  grown  back  over  the  original  region,  seg¬ 
menting  it  into  approximately  linear  segments.  Compact 
regions,  which  have  no  significant  skeleton  segments,  are 
not  segmented.  Figure  7  shows  a  simple  example. 

The  blob  detection  and  segmentation  operations  are  il¬ 
lustrated,  step-by-step,  on  a  natural  image  in  Figure  8. 

3.2  Attribute  computation  and  removal  of  spuri¬ 
ous  blobs 

At  this  point  the  blob  regions  have  been  segmented  into 
localized  blobs,  whose  attributes  can  be  computed.  The 
length,  width,  and  orientation  of  each  blob  is  computed 
from  its  second  moments  (see,  for  example,  [Horn,  1986]). 

In  addition  to  geometric  attributes,  the  average  contrast 
of  each  blob  is  computed  along  its  perimeter  using  the  quan¬ 
tity  c(j)  =  V/  *  G  •  n(s),  where  n(s)  is  the  normal  along 
the  blob’s  perimeter.  Assuming  a  step  edge  of  amplitude 


Figure  7:  Segmenting  Bars  at  Junctions  and  Intersections 

o)  A  bar-like  figure  to  be  segmented.  (6)  Skeleton  (medial  axis 
transform)  obtained  by  thinning,  (c)  Straight  line  skeleton  with 
segments  having  small  aspect  ratios  removed,  (d)  Segmentation 
obtained  by  growing  labelled  segments  back  over  original  figure. 


(«) 


Figure  8:  Steps  of  blob  detection  and  segmentation 

(a)  Image  /,  a  close-up  of  mismatched  clothing.  (6)  I  *  VJG.  (c)  /  *  VJG  >  t\.  (<T)  I  *  V2G  >  ^2  >  i\.  (e)  Segmentation 

obtained  by  growing  (d)  over  (c).  (/)  Significant  skeleton  segments  of  (e).  (g)  Segmentation  obtained  by  growing  (f)  over  (e). 
(h)  Result  of  removing  spurious  blobs  from  (g).  («)  Rectangles  showing  geometric  attributes  of  blobs  in  (h). 


a,  the  contrast  at  a  point  on  the  perimeter  c(s)  =  a/2-ncr , 
where  G(cr)  is  the  gaussian  smoothing  filter. 

Only  blobs  having  over  half  their  perimeter  above  the 
noise  threshold  (computed  as  in  Section  2.1)  are  main¬ 
tained.  This  removes  remaining  blobs  due  to  noise  as  well 
as  any  spurious  blobs  caused  by  strong  edges  along  only 
one  side  of  them.  Figure-  9  and  10  show  more  examples. 

3.3  Analysis  of  blob  detection  algorithm 

The  proposed  blob  detection  scheme  is  not  claimed  to  be 
biologic  silly  plausible.  A  more  biologically  plausible  algo¬ 


rithm  is  offered  by  Mahoney  [1987],  who  uses  elliptical 
masks  at  different  resolutions  and  orientations  to  detect 
blobs  in  a  binary  array.  However,  the  algorithm  proposed 
here  employs  local  operations  and  is  quite  efficient.  As  cur¬ 
rently  implemented  on  a  Symbolics  lisp  machine,  the  en¬ 
tire  blob  detection  process  takes  about  6  minutes  on  the 
160  x  454  pixel  books  image  shown  in  Figure  10(a).  Of  this 
time,  about  2  minutes  are  used  for  convolution,  differentia¬ 
tion,  and  noise  estimation;  3  minutes  for  blob  segmentation 
(using  5  levels);  and  1  minute  for  attribute  computation  and 
thresholding.  On  the  Connection  Machine,  we  expect  the 
algorithm  to  take  a  few  seconds. 


897 


(*) 


(c) 


Figure  9:  Blobs  due  to  surface  roughness 

( a)  Ar:  image  of  a  pant  leg  and  sock  against  a  pitted  ceiling  tile.  Color  does  not  provide  a  cue  for  iucatdig  di»ton;iiiulie»  i  i  t  « 
scene,  since  all  surfaces  are  white,  b)  All  blobs  found.  ( c )  Strong  Mobs  found  at  scale  a  =  2  pixels 


4  Why  blobs? 

We  conjecture  that  texture  in  natural  scenes  can  be  repre¬ 
sented  using  blobs  (as  detected  by  our  algorithm)  and  their 
attributes. 

By  using  blobs  instead  of  edges,  our  conjecture  resem¬ 
bles  Julesz’s;  however,  it  is  based  on  the  analysis  of  nat¬ 
ural,  rather  than  synthetic,  images.  We  claim  that  blobs 
are  more  useful  than  edges  for  representing  texture  for  two 
reasons.  First,  attributes  of  blobs  capture  more  informa¬ 
tion  than  attributes  of  edges.  First  order  statistics  of  blob 
attributes  account  for  higher  order  statistics  of  edge  at¬ 
tributes.  In  particular,  the  average  perpendicular  distance 
between  edges,  a  second  order  measure  used  by  some  (e.g., 
Kjell  and  Dyer,  1985)  are  captured  by  the  average  width 
of  the  blobs  between  them.  Second,  blobs  provide  a  more 
appropriate  resolution  for  representing  the  physical  events 
which  cause  intensity  changes.  A  blob  can  describe  the  ap¬ 
proximate  location,  size,  and  orientation  of  an  impression 
in  a  surface,  even  when  its  bounding  contour  (i.e.,  edges)  is 
not  well  defined. 

Interestingly,  in  the  images  which  we  have  examined, 
dark  blobs  are  useful  more  often  than  light  ones  as  textons. 
Small  impressions  in  the  surface,  such  as  spaces  between 
threads  in  a  fabric,  or  gaps  between  leaves  on  a  tree  man¬ 
ifest  themselves  as  small  dark  regions  in  the  image,  Less 
light  is  reflected  from  such  surfaces  because  of  shadows  cast 
by  objects  nearer  the  light  source,  or  because  of  the  oblique 
angle  of  these  surfaces  with  respect  to  the  viewer.  Even  sur¬ 
face  markings,  such  as  type  on  a  page,  are  more  often  than 
not  dark  on  light,  resulting  in  isolated  dark  blobs. 

If  desired,  termination  points  can  be  marked  at  loca¬ 
tions  which  are  the  endpoints  of  single  elongated  blobs, 
while  junctions  can  be  marked  where  three  or  four  blobs 
meet.  However,  it  is  not  clear  from  natural  images  that 


(a)  (b) 

Figure  10:  Blobs  due  to  surface  markings 


(a)  Close-up  of  of  books  on  a  wooden  table  top.  (6)  Strong  blobs 
found  at  scale  n  —  1.5  pixels. 

these  features  are  as  relevant  to  the  interpretation  of  texture 
as  some  psychophysical  experiments  might  suggest.  Even 
some  of  the  boundaries  perceived  in  the  synthetic  images 
of  Julesz  [1983]  can  be  explained  in  terms  of  the  basic  blob 
attributes  mentioned  above  [Voorhees,  1987]. 


898 


5  Conclusion 


6  References 


We  have  shown  how  noise  can  be  robustly  estimated  in  nat¬ 
ural  images,  enabling  the  automatic  selection  of  thresholds 
for  edge  and  blob  detection.  The  method  works  for  im¬ 
ages  consisting  entirely  of  highly  textured  regions  as  well 
as  those  containing  roughly  uniform  regions. 

We  have  also  described  an  algorithm  for  detecting  in¬ 
tensity  blobs  in  images,  facilitating  the  construction  of  a 
raw  primal  sketch.  In  doing  so  we  have  answered  the  first 
question  regarding  the  texton  theory  of  vision:  How  can 
textons  extracted  from  images  of  natural  scenes? 

We  are  currently  studying  the  second  obvious  question: 
Once  textons  are  detected,  how,  exactly,  are  texture  bound¬ 
aries  identified?  Marr  [1976]  suggested  that  first  order 
statistics  of  attributes  over  regions  are  compared,  but  pro¬ 
vides  no  algorithm.  We  are  investigating  two  approaches. 

The  first  method  locates  texture  boundaries  by  using  a 
non-linear  smoothing  operator  to  compute  attributes  over 
neighborhoods  of  textons,  resulting  in  texton  attribute  ar¬ 
rays.  This  method,  which  is  somewhat  analogous  to  “preat- 
tentive”  visual  processing  (described  by  [Julesz,  1983]), 
identifies  gross  changes  in  attributes  which  give  rise  to 
clearly  perceived  texture  boundaries.  Interestingly,  these 
gross  texture  changes  correspond  to  the  type  caused  by 
physical  discontinuities.  We  are  currently  doing  experi¬ 
ments  to  determine,  for  each  attribute,  how  large  a  dif¬ 
ference  should  be  considered  significant. 

The  second  method  uses  statistical  tests  to  determine 
whether  attributes  of  textons  in  regions  on  either  side  of 
a  selected  boundary  are  from  different  populations.  This 
process,  which  may  be  analogous  to  “attentive”  processing, 
can  identify  more  subtle  differences  than  the  first  method, 
differences  which  require  scrutiny,  and  which  do  not  usually 
correspond  to  physical  discontinuities.  These  two  methods 
are  explained  in  detail  in  Voorhees  [1987]. 

It  should  be  noted  that  in  the  case  of  occlusion,  density 
of  textons  is  probably  the  most  useful  attribute.  Two  dif¬ 
ferent  surfaces  are  usually  at  different  depths,  and  hence  it 
is  unlikely  that  both  yield  textons  at  a  single  scale  of  anal¬ 
ysis.  Texture  boundaries  based  on  blob  attributes  are  more 
likely  to  occur  at  orientation  discontinuities  or  in  images 
where  the  depth  of  field  is  small. 

Future  research  will  address  the  third  question:  Once 
texture  boundaries  are  detected,  how  can  they  be  inter¬ 
preted  in  terms  of  their  physical  cause?  Riley  [1981]  of¬ 
fers  some  general  observations,  but  more  specific  rules  are 
needed.  Given  a  description  of  the  attribute  distributions 
on  either  aide  of  a  boundary,  one  would  like  to  ascertain  the 
likelihood  that  the  boundary  is  due  to  occlusion,  a  change 
in  surface  orientation,  a  surface  marking,  or  a  shadow. 


Bracho,  R.  and  A.  C.  Sanderson.  “Segmentation  of  Im¬ 
ages  Based  on  Intensity  Gradient  Information,”  Proc. 
CVPR-85 ,  pp.  341-347,  1985. 

Canny,  3.  F.  “Finding  Edges  and  Lines  in  Images,”  MIT 
AI  Lab,  AI-TR-720,  June  1983. 

Horn,  B.  K.  P.  Robot  Vision.  Cambridge,  MA:  MIT  Press, 

1986. 

Julesz,  B.  and  J.  R.  Bergen.  “Textons,  The  Fundamen¬ 
tal  Elements  in  Preattentive  Vision  and  Perception 
of  Textures,”  Bell  System  Technical  Journal ,  Vol.  62, 
No.  6,  July-Aug.  1983,  pp.  1619-1645. 

Kjell  B.  P.  and  C.  R.  Dyer,  “Edge  Separation  and  Orien¬ 
tation  Texture  Measures,”  Proc.  CVPR-85,  pp.  306- 
311,  1985. 

Mahoney,  J.  V.  “Image  Chunking:  Defining  Spatial  Build¬ 
ing  Blocks  for  Scene  Analysis,”  M.S.  Thesis,  M.I.T. 
Dept,  of  E.E.C.S.,  Jan.  1987. 

Marr,  D.  “Analyzing  Natural  Images:  A  Computational 
Theory  of  Texture  Vision,”  Cold  Spring  Harbor  Sym¬ 
posia  on  Quantitative  Biology,  Vol.  XL,  pp.  647-662, 
1976. 

Marr,  D.  Vision,  San  Francisco:  W.  H.  Freeman,  1982. 

Marr,  D.  and  E.  Hildreth,  “Theory  of  Edge  Detection,” 
Proc.  Royal  Society  of  London,  Vol.  B-207,  pp.  187- 
217,  1980. 

Riley,  M.  D.  “The  Representation  of  Image  Texture,”  MIT 
AI  Lab  AI-TR-649,  Sept.  1981. 

Voorhees,  H.  “Finding  Texture  Boundaries  in  Natural  Im¬ 
ages,”  M.S.  Thesis,  M.I.T.  Dept,  of  E.E.C.S.,  Jan. 

1987. 

Witkin,  A.  “Scale-Space  Filtering,”  Proc.  1JCA1-8S,  pp. 
1019-1022. 


RECONSTRUCTION  OF  SURFACES  FROM 

PROFILES 


Peter  Giblin  * 

Department  of  Mathematics 
University  of  Massachusetts,  Amherst,  MA  01003 


Richard  Weiss  t 

Dept  of  Computer  and  Information  Science 
University  of  Massachusetts,  Amherst,  MA  01003 


ABSTRACT 

This  paper  presents  an  algorithm  for  computing  a  depth 
map  for  a  smooth  surface  from  a  sequence  of  profile  curves. 
The  algorithm  requires  that  the  viewing  directions  be  copla- 
nar.  In  addition  formulae  are  derived  for  computing  di¬ 
rectly  the  Gauss  and  mean  curvatures  without  first  com¬ 
puting  a  depth  map.  We  have  used  the  algorithm  to  re¬ 
construct  curves  from  their  profiles  with  a  high  degree  of 
accuracy  from  synthetic,  noise-free  data. 


1.  INTRODUCTION 

We  can  tell  a  lot  about,  the  shape  of  an  object  from 
a  single  profile,  and  with  multiple  views  we  ran  often  de¬ 
termine  the  shape  uniquely.  In  this  paper  we  analyze  this 
reconstruction  process  mathematically  and  derive  an  algo¬ 
rithm  to  produ.  f  a  depth  map.  For  some  applications  it 
may  not  necessary  to  produce  a  depth  map  at  all  (for  ex¬ 
ample  in  recognition  problems).  The  Gauss  and  mean  cur¬ 
vatures  may  be  sufficient  by  themselves  to  solve  this  prob¬ 
lem,  and  in  any  case  it  will  be  useful  to  decompose  surfaces 
into  patches  according  to  whether  they  are  convex,  concave, 
hyperbolic,  parabolic,  or  planar  (Besl  and  Jain  |I|,  Brady 
et  al.  [2],  Ferrie  and  Levine  (4|).  This  gives  a  method 
for  describing  surfaces  of  objects  and  a  basis  for  matching 
with  standard  surfaces  such  as  spheres,  planes,  and  cylin¬ 
ders.  Koenderink  and  van  Doom  |7|  have  derived  a  formula 
which  could  be  used  to  compute  the  Gauss  curvature  from 
a  sequence  of  intensity  images  without  depth.  We  have 


*  The  work  wa*  done  while  the  author  was  on  leave  from  Dept  of 
Pure  Mathematic*,  University  of  Liverpool  Lfi'JIBX,  England,  and  wa* 
a  Visiting  Professor  of  Mathematics  at  the  University  of  Massachusetts, 
He  also  received  generous  support  from  Five  Colleges  Inc. 

tThe  work  of  this  author  wa*  supported  by  the  Air  Force  Office  of 
Scientific  Research  under  grant  F49620-83-C-0009,  hy  the  Defense  Ad¬ 
vanced  Research  Projects  Agency  under  contracts  N00014-82-K-0464 
and  DACA78-85-C-0008,  and  by  the  National  Science  Foundation  un¬ 
der  grant  DCR-8318778. 


extended  these  results  to  produce  formulae  for  the  mean 
curvature,  principal  curvatures,  and  principal  directions. 

The  use  of  profiles  for  the  recognition  and  description 
of  surfaces  has  been  explored  by  many  people  (Koenderink 
and  van  Doom  |7|,  Callahan  and  Weiss  [3],  Hoffman  and 
Richards(5|),  but  most  investigations  have  not  combined  in¬ 
formation  from  multiple  views.  In  general,  there  is  no  way 
to  identify  a  point  on  one  profile  with  a  corresponding  point 
on  a  profile  from  a  different  view  since  for  smooth  surfaces 
they  will  not  have  any  points  in  common.  In  fact,  most 
stereo  algorithms  which  are  based  on  correspondence  find 
the  most  similar  point  and  assume  it  is  the  same.  However, 
if  the  camera  motion  is  known,  then  there  is  a  method  to 
identify  points  on  two  different  profiles.  In  our  work,  we 
have  restricted  the  camera  to  planar  motion,  so  that  planes 
parallel  to  the  plane  of  motion  induce  a  correspondence 
between  the  profiles.  However,  it  is  possible  for  the  pro¬ 
file  to  change  qualitatively  between  views,  and  in  order  to 
understand  this,  we  have  analyzed  the  analogous  problem 
for  a  curve  in  the  plane.  These  transitions  create  ambigu¬ 
ities  in  the  reconstruction  process.  The  criterion  used  to 
resolve  this  ambiguity  is  that  the  most  likely  solution  is  the 
one  which  minimizes  the  change  in  depth  between  adjacent 
views. 

The  mathematical  approach  to  this  problem  is  that  a 
smooth  surface  without  inflection  points  is  the  envelope  of 
all  of  its  tangent  planes.  However,  there  are  two  problems 
with  this;  how  to  compute  the  envelope  of  a  family  of  planes 
and  how  to  handle  inflection  points.  With  the  assumption 
of  planar  camera  motion,  we  have  been  able  to  reduce  the 
envelope  of  planes  problem  to  that  of  computing  the  en¬ 
velope  of  a  family  of  lines  in  a  plane,  which  we  were  able 
to  solve.  The  problem  of  inflection  points  requires  interpo¬ 
lation  of  the  surface.  We  have  a  simple  approach  to  this 
which  would  work  in  most  cases,  and  we  plan  to  extend 
this  to  polyhedral  surfaces  where  every  points  is  either  an 
inflection  point  or  a  crease  point. 

In  Section  2,  we  give  the  mathematical  framework  for 
discussing  profiles.  Then  for  simplicity  in  Section  3,  we 
take  the  case  of  reconstructing  a  plane  curve.  In  Section  4, 
this  is  generalized  to  surfaces.  In  Section  5,  we  present  the 


900 


-j-rv,r*rw 


experimental  results  using  simulated  data. 

2.  THE  PROFILES  OF  A  SURFACE 

Let  u  be  a  unit  vector  in  3-space  Rs  and  let  M  be  a 
smooth  surface  in  R3.  We  regard  u  as  defining  a  viewing 
direction.  On  the  surface  M  there  is  a  locus  of  points  p 
for  which  the  tangent  plane  contains  the  viewing  direction. 
This  locus  of  points  is  called  the  critical  set  corresponding 
to  u.  If  this  curve  is  projected  parallel  to  u  onto  the  viewing 
plane  through  the  origin  which  is  perpendicular  to  u,  the 
image  is  called  the  profile  (Figure  1).  For  future  reference 
we  note  the  following  standard  facts  about  critical  sets  and 
profiles: 

•  The  critical  set  is  a  smooth  curve  at  p  unless  p  is  a 
parabolic  point  and  u  is  the  asymptotic  direction. 

•  The  profile  near  q  in  the  viewing  plane  arising  from  p 
on  the  critical  set  is  a  smooth  curve  unless  the  viewing 
direction  is  an  asymptotic  direction. 

•  When  the  critical  set  is  smooth  at  p,  its  tangent  di¬ 
rection  is  parallel  to  the  viewing  plane  if  and  only  if 
the  viewing  direction  is  a  principal  direction  at  p. 

•  When  the  profile  is  smooth  at  </,  its  tangent  at  q  is 
always  in  the  tangent  plane  to  the  surface  at  p. 

The  Gauss  curvature  can  be  computed  from  the  profiles. 
Consider  a  point  p  of  the  critical  set.  The  viewing  direction 
at  p  together  with  the  normal  to  the  surface  there  determine 
a  plane,  whose  intersection  with  the  surface  is  a  smooth 
curve.  The  curvature  of  this  curve  at  p  on  the  critical  set 
is  called  the  radial  curvature.  The  projection  of  p  onto  q 
lies  on  the  profile,  and  the  curvature  of  the  profile  at  q  is 
called  the  transverse  curvature.  Koenderink  and  van  Doom 
[7|  showed  that  the  Gauss  curvature  is  the  product  of  the 
transverse  curvature  and  the  radial  curvature.  Brady  et  al. 

(2]  has  also  given  an  independent  proof  of  this  fact.  The 
curvature  of  the  profile  can  be  computed  directly  from  an 
image,  and  the  radial  curvature  could  be  computed  from  a 
sequence  of  images. 

3.  RECONSTRUCTING  PLANE  CURVES 

Consider  a  smooth  plane  curve  C,  which  we  usually  take 
to  be  closed,  so  that  it  is  given  by  a  parametrization  -y(f)  = 
(X(t),  Y (*))  such  that  the  derivative  is  never  zero,  i.e.  X'(t) 
and  Y"(t)  are  never  zero  for  the  same  f .  As  shown  in  Figure 
2,  u  is  the  viewing  direction.  The  viewing  line  is  the  line 
perpendicular  to  u  through  the  origin.  The  profile  in  this 
•  context  is  the  set  of  points  on  the  viewing  line  for  which 
the  tangent  line  is  parallel  to  u.  The  position  of  a  profile 
point  can  be  expressed  as  w  ■  (sin 9,  - cost) ),  where  w  is  the 
signed  distance  from  the  origin,  O,  to  the  profile  point. 

I  For  example,  in  Figure  2,  there  are  three  points  for  which 


w  >  0  and  one  for  which  w  <  0.  Note  that  if  9  produces 
a  value  tt>,  then  9  +  7r  produces  —  w.  Thus  we  restrict  to 
0  <  6  <  n.  The  collection  of  all  profile  points  forms  a 
curve  in  the  plane  classically  called  the  pedal  curve  of  C 
with  respect  to  O.  If  9  changes  in  such  a  way  that  a  pair 
of  values  of  w  is  annihilated  and  disappear,  or  a  pair  is 
created,  then  the  pedal  curve  has  a  singular  point,  (usually 
a  cusp).  This  happens  when  ti  passes  through  the  direction 
of  an  inflexional  tangent  to  C,  as  shown  in  Figure  3.  In  a 
neighborhood  of  that  point,  w  cannot  be  a  smooth  function 
of  9.  In  general,  there  will  be  parts  of  C  for  which  w  is  a 
function  of  9  for  some  range  of  values  of  9.  Suppose  C,  is 
such  a  subset  of  C,  then  the  following  proposition  states 
that  C i  can  be  reconstructed  from  the  function  tv  ---  tv  (9). 
Any  part  of  the  curve  which  has  no  inflexions  satisfies  this 
property.  Another  way  to  say  this  is  that,  the  Gauss  map  of 
Ct  has  an  inverse,  i.e.  for  each  direction  on  the  unit  circle 
there  is  at  most  one  point  of  C\  whose  normal  points  in 
that  direction.  Such  curves  possibly  with  singularities  have 
been  studied  by  Langevin,  Levitt,  and  Rosenberg  [9]  are 
called  herissons  (hedgehogs) 

Proposition  1  1.  The  curve  C i  consists  of  points 

x  —  w  sin  9  -t  tv'  cos  9 
y  =  —tv  cos  9  f  w'  sin  9 

and  w'  =  du> /d9  is  the  distance  from  the  profile  point 
to  the  corresponding  point  of  C\ 

S.  The  radius  of  curvature  of  C\  at  the  point  correspond¬ 
ing  to  9  is  w  f  w" .  (Here  C\  is  oriented  by  increasing 
9.) 

proof:  The  line  through  the  profile  point  to  ■  (sin  9. cos  9) 
parallel  to  the  viewing  direction  u  =  (cos  9,  sin  9)  has  the 
equation 

((x,  y)  -  (u/sinfl,  -in  cos  9))  ■  (sin  9,  -  cos  9)  =  0 
i.e. 

xsintf  -  ycosff  =  w  (1) 

The  curve  Ct  is  the  envelope  of  the  lines  obtained  by  elim¬ 
inating  9  between  equation  (1)  and  the  following  equation: 

xcos0  +  ys\n9  =  w'  (2) 

This  gives  the  unique  solution  (x,y)  in  part  1  of  the  Propo¬ 
sition.  Rewriting  this  as 

(x,y)  =  tu(sinff,  -  cos0)  +  u/(cos0,sintf) 

proves  that  w'  is  the  distance.  The  formula  for  the  radius 
of  curvature  for  a  curve  is  given  by: 

p  =  (xn  +  y yi'Kx'y”  -  x"y') 

Differentiating  the  equations  in  part  1  and  substituting  into 
the  formula  for  p  gives  part  2  of  the  Proposition. 


I 


901 


'~*^**m*:t  , 


It  is  not  possible  for  w  +  w"  to  he  7,ero  while  C i  is 
smooth:  zero  radius  of  curvature  is  only  possible  at  a  cusp 
of  a  curve.  On  the  other  hand  w  +  w"  can  tend  to  infinity, 
making  the  curvature  tend  to  zero  (which  is  an  inflexion). 
But  at  an  inflexion  w  is  no  longer  a  function  of  8  so  it  is 
not  in  Ci- 

Examples  of  these  two  cases  are  w  —  8s  and  w1  =  83 
respecively  (see  Figure  4).  For  the  former,  x  —  3fl2 -(-higher 
order  terms,  and  y  =  28s  +  higher  order  terms.  Thus  C,  has 
a  cusp  and  the  pedal  curve  has  an  inflexion  at  8  —  0.  For  the 
latter,  x  =  ±\8l^+  higher  order  terms,  and  y  =  ±\63/3+ 
higher  order  terms.  Thus,  Cx  has  an  inflexion  and  the  pedal 
curve  has  a  cusp  at  8  =  0. 

In  practice  if  we  start  with  a  curve  C,  and  measure  its 
profile  data,  i.e.  for  each  viewing  direction  u  —  (cos  9,  sin  8) 
for  some  range  of  values  for  8  we  obtain  the  various  values 
of  w  for  the  profile  points.  Some  values  of  8  will  give  more 
values  of  w  than  others,  unless  C  has  no  inflexions,  in  which 
case  each  value  of  8  will  have  two  values  of  w.  Starting  at 
some  value  8  --  80,  we  choose  one  of  the  corresponding  val¬ 
ues  to  be  w0  and  increase  8  following  w  as  a  function  of 
8  until  an  inflexion  is  encountered.  This  is  detected  by  a 
change  in  the  number  of  tti-values.  In  fact  8n  can  be  chosen 
so  that  it  immediately  follows  after  an  inflexion.  It  turns 
out  that  one  only  needs  to  consider  the  case  when  the  num¬ 
ber  of  iu- values  decreases  by  two.  The  parts  of  the  curve 
C  lying  between  inflexions  can  be  reconstructed  using  part 
1  of  the  Proposition.  The  computational  problem  to  be 
solved  is  this:  having  chosen  w0  for  8  8n,  what  value  cor¬ 

responds  to  8n  +  68?  It  is  not  sufficient  to  simply  choose 
the  closest  value  of  to.  Since  it  is  possible  for  the  pedal 
curve  to  have  crossings  (See  Figure  5),  there  is  the  dan¬ 
ger  of  starting  on  one  branch  and  switching  to  the  other. 
Note  that  we  are  approximating  the  function  to  -  f(8)  at 
discrete  points,  and  we  have  assumed  that  this  function  is 
continuous  since  C  is  smooth.  In  addition,  since  to'  is  the 
distance  from  the  viewing  line  to  the  point  of  tangency  on 
the  curve,  we  also  want  to'  to  be  continuous.  This  can  be 
viewed  as  a  constraint  on  choosing  successive  values  of  to 
such  that  to"  is  minimized. 

4.  PERSPECTIVE  PROJECTION  OF 
CURVES 

There  is  no  essential  difference  in  the  mathematics  of  re¬ 
constructing  curves  from  profiles  obtained  from  perspective 
projection.  We  derive  the  formulas  here  for  completeness. 
Assume  that  the  curve  C  is  contained  in  a  circle  of  radius 
r.  From  each  point  A  on  the  circle  we  define  the  viewing 
line  to  be  the  line  parallel  to  the  tangent  to  the  circle  at 
A  and  a  distance  d  away  from  it.  Each  tangent  line  to  C 
through  A  will  intersect  the  viewing  line  at  a  profile  point, 
(See  Figure  6)  The  profile  points  are  identified  by  their  dis¬ 
tance  w  in  a  counterclockwise  direction  from  the  origin  ol 


the  viewing  line,  which  is  the  closest  point  to  A.  The  values 
of  u)  for  8  and  8  +  ir  are  no  longer  directly  related. 

Proposition  2  For  regions  of  the  curve  in  which  w  is  a 
function  of  8, 

rwd  sin  8  +  rw1  cos  8  +  rw'd  cos  8 
X  d?  +  w*  +  dw' 

rwd  cos  8  +  rw1  sin  8  +  rw'd  sin  8 
^  d?  +  w3  +  dw' 

Thus  in  this  case  also  the  parts  of  C  between  inflexions  can 
be  recovered  from  a  knowledge  of  the  function  w.  However, 
in  this  case  the  formula  for  the  curvature  is  rather  more 
complicated. 

5.  SURFACES 

We  would  like  to  reconstruct  a  surface  from  its  profiles 
in  a  way  analogous  to  the  method  used  for  curves  in  the  pre¬ 
vious  section.  The  situation  for  surfaces  differs  in  two  re¬ 
spects:  each  profile  is  a  curve  and  there  is  a  two-parameter 
family  of  viewing  directions  which  as  unit  vectors  are  points 
on  the  sphere  S2.  We  seek  to  find  all  the  tangent  planes 
to  M,  and  for  this  purpose,  it  is  sufficient  (and  necessary) 
to  know  the  profiles  for  a  “great  circle”  of  viewing  direc¬ 
tions,  i.e.  all  directions  which  lie  in  a  plane.  Consider  the 
tangent  plane  to  M  at  p.  The  unit  tangent  vectors  in  this 
plane  form  a  great  circle  on  S2,  and  this  circle  intersects 
the  great  circle  of  viewing  directions.  Hence  some  tangent 
line  to  M  at  p  is  in  the  viewing  direction  u,  so  that  p  con¬ 
tributes  a  point  q  to  the  profile  for  the  viewing  direction  u. 
Of  course,  for  opaque  surfaces  the  situation  is  complicated 
by  occlusion,  and  it  is  possible  that  for  some  points,  none 
of  the  singular  sets  would  be  visible. 

How  do  we  find  the  tangent  plane  to  M  at  p?  One  line 
in  that  plane  is,  of  course,  the  line  through  q  parallel  to 
u.  Another  line  is  the  tangent  line  to  the  profile  at  q  (see 
Figure  7).  If  the  profile  is  singular  at  q  then  the  tangent  line 
is  interpreted  as  a  limit  of  tangent  lines  at  nearby  smooth 
points  of  the  profile.  If  the  profile  has  a  crossing  at  q ,  then 
both  branches  will  contribute  a  tangent  plane  to  M  but  at 
different  points  p  of  M  ((7|,  [6],  (3)). 

We  shall  now  reconstruct  M  from  the  profiles  corre¬ 
sponding  to  a  circle  of  viewing  directions  (except  where  the 
tangent  plane  is  parallel  to  the  plane  of  the  circle  of  viewing 
directions).  The  calculation  which  follows  is  local  in  that 
it  assumes  a  pai ametrization  of  profiles  near  each  point. 

Consider  a  family  of  viewing  directions,  u  =  (0,  cos  8,  sin  8) 
and  corresponding  viewing  planes  y  cos  0  +  z  sin  0  =  0.  Each 
viewing  plane  will  contain  a  profile  of  M;  we  use  orthonor¬ 
mal  coordinates  (x,w)  in  the  viewing  plane,  where  the  w- 
axis  is  in  the  direction  (O,sin0,cos0)  and  the  r-axis  is  the 
same  for  all  of  the  planes.  A  point  (z,  w)  in  a  viewing  plane 
then  is  the  point 


902 


i(l,0,0)  +  tt>(0,sintf,  —  costf)  "  (i.msin  fl,  incos#)  (3) 


in  (x,y,z)  space.  In  practice  it  is  the  numbers  (x,w)  which 
will  be  measured  from  an  image. 

In  general,  the  profile  will  be  be  a  curve  with  isolated 
cusps,  and  we  present  the  theory  for  reconstructing  the 
surface  from  smooth  points  of  profiles.  The  case  of  those 
cusp  points  is  still  under  investigation.  In  practice,  this 
should  not  be  a  problem  since  one  can  compute  the  sur¬ 
face  at  points  in  a  neighborhood  of  a  cusp  where  the  pro¬ 
file  is  smooth.  Assume  that  the  profiles  have  the  form 
w  =  u>(x,#)  over  a  range  of  values  of  #  and  z,  where  w(x,0) 
is  a  smooth  function.  Thus  we  assume  that  over  some  range 
of  values  of  x  and  #  the  profiles  are  smooth  and  not  tangent 
to  the  te-axis.  So  starting  with  a  point  on  a  given  profile 
of  M,  we  choose  an  axis  of  rotation  which  is  not  parallel  to 
the  normal  to  the  profile  at  that  point.  It  is  intuitively  rea¬ 
sonable  that  if  the  viewing  direction  remains  in  the  tangent 
plane,  no  additional  information  will  be  obtained. 

Now  consider  a  fixed  x  and  # ,  i.e.  a  fixed  point  on  a 
particular  profile  curve.  The  tangent  plane  to  M  deter¬ 
mined  by  this  x  and  8  passes  through  (x,sin#,  -w  cos  0) 
and  contains  the  directions: 

(0, cos#, sin  0)  (the  viewing  direction) 

(l,  to*  sin#, -to*  cos#)  (tangent  to  the  profile) 
where  wz  =  dw/dx.  The  equation  of  the  plane  is  there¬ 
fore 

vixX  sin#V'  I  cos#/  xwz  in  (4) 

where  we  temporarily  use  (X,Y,Z)  as  current  coordinates 
in  R3  to  avoid  the  double  use  of  x. 

Remark:  If  the  profiles  are  parametrized  as  x  x(/,#), 
w  =  w(t,8)  for  some  parameter  t,  then  the  tangent  plane 
is 

wtX  -  x, sin#K  +  XfCos 9Z  -  xw,  x,w  (5) 

The  surface  M  is  the  envelope  of  all  the  tangent  planes 
described  by  (4),  that  is  we  obtain  the  point  (X,Y,Z)  of 
M  corresponding  to  (x,#)  by  eliminating  x  and  #  between 
(4)  and  its  derivatives  with  respect  to  x  and  #,  viz. 

d/dx  :  wttX  ~  xwzz  (6) 

d/d#:  wzi  X  -  cos#K  -  sin#/  -  xwTt  wf  (7) 

This  amounts  to  finding  the  intersection  of  three  tangent 
planes  given  by  (x,#),  (x  f  6x,#),  and  (r,#  +  60).  The 
intersection  of  these  three  planes  will  approach  the  point 
on  M  in  the  limit  as  6x  and  69  go  to  zero.  This  runs 
into  problems  when  one  or  the  other  direction  produces  a 
stationary  tangent  plane.  According  to  equation  (6),  X  —  x 
(the  line  through  a  profile  point  in  a  viewing  direction  is 
always  in  a  plane  X  —  &  constant),  but  (6)  also  indicates 
that  if  the  profile  has  an  inflexion  (to**  0),  then  there 

will  be  problems  distinguishing  one  tangent  plane  to  M 
from  the  “next”.  This  is  similar  to  the  case  for  curves  in 


which  the  envelope  of  tangent  lines  to  a  plane  curve  with 
an  inflexion  contains  the  whole  inflexional  tangent,  line.  In 
fact  (4),  ((>),  and  (7)  determine  a  unique  point  (X.Y.Z)  if 
and  only  if  wrr  j  0,  namely  the  point 

/(*,*)  -  (x  ,  w  sin  0-i  nig  cos  #,  hi  cos  #  t  inn  sin#)  ^  M  (8) 

Note  that  wp  is  the  distance  from  a  profile  point  to  the 
corresponding  point  on  M . 

Remark  In  the  general  rase  where  tv  is  not  always  a  function 
of  x,  we  ran  parametrize  the  family  of  profile  curves  as 
follows: 

X  x(f ,  #) 

HI  —  t  <’(/,#) 

In  this  case,  except  where  x(  is  zero,  equation  (8)  becomes 
/(<,#)  -  (x,u>sin#,  tecos#)  I  ^  '  1  *  j  (O.ros  #,sin  #) 

(») 

and  (-  x/iw,  +  x,tve)/xt  is  the  distance  from  a  profile  point 
to  a  point  on  M . 

The  formula  (8)  tells  us  how  to  reconstruct  Af  from  its 
profile  data,  as  long  as  tv  ran  be  expressed  as  a  function 
of  x  and  #.  The  condition  for  /  to  give  a  smooth  piece  of 
surface  (i.e.  the  condition  that  the  differential  of  /  has  rank 
2)  is  tv  I  wee  0;  note  (hat  this  also  arose  in  the  rase  of 
curves  above. 

On  M  at  any  point  f(x.0),  we  have  coordinate  direc¬ 
tions  corresponding  to: 

df/dx  fr  -  ( 1 ,  tvz  sin  #  I  U7/!  r  os#.  117  cos#  I  117s  sin#) 

where  /,  is  in  the  direction  of  the  critical  set  and  fp  is  in 
the  viewing  direction  so  they  are  not  in  general  orthogo¬ 
nal.  Nevertheless,  they  are  conjugate  with  respect,  to  the 
second  fundamental  form  (see  Figure  8).  Note  that  fz 
and  ft 1  will  not  coincide  at  a  smooth  point  of  the  pro¬ 
file.  Geometrically,  this  means  that  with  respect,  to  the 
ellipse  determined  by  this  quadratic  form,  each  direction 
is  tangent  to  the  ellipse  at  the  point  of  intersection  by  the 
axis  determined  by  the  other.  Algebraically,  the  matrix 
associated  with  the  serond  fundamental  form  is  diagonal 
with  respect  to  the  basis  of  these  two  directions  and  (using 
n  -  (  to*, sin#,  cos#)/(l  )  ic3),/J  as  the  unit  normal  to 
M )  can  be  written  as: 


Thus,  it  is  possible  to  derive  simple  formulae  the  Gaussian 
and  mean  curvatures  of  M  in  terms  of  the  profile  data  in. 
As  noted  above,  the  consequent  formula  for  the  Gaussian 
curvature  as  the  product  of  the  radial  curvature,  sy  and  the 
transverse  curvature,  #c*,  was  known, but.  the  mean  curva¬ 
ture  cannot  be  expressed  in  terms  of  only  these  two. 


903 


6.  RELATIONSHIP  BETWEEN  SECTIONAL 


CURVATURES  AND  THE  SURFACE 
CURVATURES 

There  are  three  curvatures  which  enter  into  the  equa¬ 
tions  for  the  surface  curvatures. 

Ke  is  the  curvature  of  the  profile  or  the  radial  curvature. 
Its  formula  is 

<Ce  =  tVZI/(l  +  wl)3/I 

kz  is  the  sectional  curvature  of  M  in  the  f,  direction 
(the  direction  of  the  tangent  to  the  critical  set).  We  find 
(see  below)  that 


curvature  is  their  sum.  Thus,  when  the  matrix  is  diagonal, 
fz  and  f,  are  the  principal  directions  and  they  are  orthog¬ 
onal. 

When  the  profile  has  a  cusp,  the  curvatures  K  and  H 
will  be  the  limits  of  the  values  of  the  formulae  at  corre¬ 
sponding  smooth  points  near  it.  However,  K  for  example 
cannot  be  expressed  as  KcKf,  since  kc  is  infinite  and  x,  is 
zero.  Finding  a  formula  to  replace  this  is  a  goal  of  current 
investigation. 

T.  EXPERIMENTAL  RESULTS 


K.z  -  W„/|(l  F  UJ2)1/2(l  +  U>1  +  «"2,)] 

k,  is  the  sectional  curvature  of  M  in  the  ft  direction, 
which  is  the  viewing  direction  u  when  w  4  wm,  >  0  and  -u 
when  w  +  wte  <  0.  This  is  also  known  as  the  transverse 
curvature.  We  find  (see  below)  that 

ic,  =  -1/[(1  +  u/2)l/2(te  4-  tv  fig )  1 


Proposition  3  1.  The  Gauss  curvature  K  and  the  mean 

curvature  H  of  M  at  f(x,9)  are  given  by 


K  =  -^«/|(l  4-  t"2)2(t"  +  i"«#)l  =  KcK.fi 


H  = 


Wzz(u)  +  Wfifi)  -  l  -Utlfi 
2(w  4  Wfit) 


S.  wzt  —  0  if  and  only  if  the  viewing  direction  is  a  princi¬ 
pal  direction  on  M  at  the  corresponding  potnt.  In  this 
case  fz  and  ft  are  the  principal  directions,  Kc  -  Kz, 
and  H  =  l(ice  +  rc*). 


proof:  The  computation  of  the  Gauss  and  mean  curvatures 
from  a  local  parametrization  /  of  Af  can  be  found  in  O’Neill 
|8|.  We  can  obtain  fz  and  ft  from  equation  (8)  and  compute 
the  first  fundamental  form  as 

/  1  +  w\  4-  w\,  wZfi(w  +  Wfifi)  \ 

\  wz,(w  +  w,fi)  (uM- *"»«)  ) 

This  together  with  the  second  fundamental  form  given  above 
allows  us  to  compute  the  surface  curvatures.  Thus  K  and 
H  can  be  expressed  in  terms  of  the  curvatures  kc,i c,,(c(: 

The  shape  operator,  which  is  the  derivative  of  the  Gauss 
mapping,  referred  to  the  basis  /,,  /«  is: 


In  order  to  determine  the  potential  accuracy  of  the  algo¬ 
rithm  for  reconstructing  curves,  several  experiments  were 
performed.  Synthetic  data  was  used  to  generate  profile 
points  for  the  various  values  of  6.  Figure  8.  shows  a  picture 
of  a  curve  with  two  inflexions  and  the  pieces  of  the  curve 
which  are  reconstructed  from  the  data. 

In  principle,  one  could  start  with  the  profile  data  and 
trace  the  curve  from  start  to  finish,  reversing  direction  at 
inflexion  points,  always  choosing  the  value  of  w  which  mini¬ 
mizes  w".  However,  there  is  no  sure  test  on  the  values  for  w 
which  will  guarantee  that  a  point  is  an  inflexion  point.  It  is 
possible  to  find  all  of  the  tangent  directions  for  the  inflexion 
points,  which  can  be  called  flex  directions.  These  are  the 
directions  at  which  the  number  of  tangent  lines  changes. 
We  use  these  flex  directions  to  break  up  the  profile  data 
into  heurisson  segments  for  which  w  is  a  function  of  the 
viewing  direction.  The  current  algorithm  constructs  arcs 
corresponding  to  the  heurisson  segments  between  inflexion 
points. 

As  a  practical  point,  one  can  choose  the  place  to  start 
drawing  the  curve  to  be  any  w  value  after  a  flex  direc¬ 
tion.  Once  the  pieces  of  the  curve  between  the  inflexions 
are  drawn,  they  can  be  linked  together  with  straight  lines 
based  on  proximity  of  endpoints  (assuming  one  has  started 
with  a  closed  curve.  In  theory  one  would  like  to  minimize 
the  integral  of  the  absolute  value  of  w".  In  the  current  al¬ 
gorithm  we  only  apply  this  criterion  locally  to  choose  the 
values  of  w  sequentially.  It  may  be  necessary  to  apply  stan¬ 
dard  relaxation  techniques  when  dealing  with  real  data. 

The  algorithm  can  be  extended  easily  to  surfaces,  and 
we  intend  to  apply  this  to  the  problem  of  model  acquisition 
from  physical  prototypes. 


_ } _ (  U>zz{w  +  Wfi,)  -wzzwz,  \ 

(1  +  w*)(w  4-  w,fi)  \  wz,(w  +  w„)  -(1  4  w\  +  w\t)  ) 

The  principal  directions  are  the  eigenvectors  of  this  ma¬ 
trix.  One  can  see  that,  given  the  assumption  that  w  +  wtt 
is  nonzero,  the  matrix  is  diagonal  if  and  only  if  wz,  =  0. 
The  principal  curvatures  are  the  eigenvalues,  and  the  Gauss 
curvature  is  the  product  of  the  eigenvalues  and  the  mean 


8,  CONCLUSION 

We  have  given  a  procedure  for  reconstructing  surfaces 
from  a  sequence  of  profiles.  In  addition  we  have  given  a 
formula  for  mean  curvature  in  terms  of  the  profile  data 
which  extends  the  similar  result  for  the  Gauss  curvature. 
This  makes  it  possible  to  compute  both  of  these  curvatures 
without  first  computing  a  dense  depth  map.  We  have  used 
this  algorithm  on  synthetic,  noise-free  data  to  reconstruct 


904 


curves  with  a  high  degree  of  accuracy,  and  we  plan  to  test 
it  on  real  data. 

ACKNOWLEDGEMENTS 
We  would  like  to  thank  James  Callahan  for  his  many 
helpful  conversations  and  suggestions. 

REFERENCES 

( l|  Besl,  P. J.  and  Jain,R.C.,  “Segmentation  Through  Sym¬ 
bolic  Surface  Description”  Proceedings  CVPR86,  Mi¬ 
ami,  June  1986,  pp.  77-85. 

[2|  Brady,M.,  Ponce, J.,  Yuille,A.,  and  Asada,H.,  “Describ¬ 
ing  Surfaces”,  Proceedings  of  2nd  International  Sym¬ 
posium  on  Robotics  Research ,  Hanafusa  and  Inoue 
(Eds.),  MIT  press,  Cambridge,  Ma. 

[3]  Callahan,  J.  and  Weiss,  R.,  “A  Model  for  Describing 
Surface  Shape”,  Proc.  IEEE  Conf.  Computer  Vision 
and  Pattern  Recognition,  San  Francisco,  June  1985, 
pp.  240-245. 


[4|  Ferrie,  F.P.,  and  Levine,  M.D.,  “Piecing  Together  the 
3D  Shape  of  Moving  Objects:  An  Overview”,  Proc. 
IEEE  Conf.  Computer  Vision  and  Pattern  Recogni¬ 
tion,  San  Francisco,  June  1985,  pp.  574-584. 

(5|  Hoffman,  D.D.  and  Richards,  W.,  “Parts  of  Recogni¬ 
tion”,  MIT  AI  Memo  732,  Dec.  1983. 

(6]  Kergosien,  Y.L.,  “La  famille  des  projections  orthogo- 
nales  d’une  surface  et  ses  singularities”,  C.R.  Acad. 
Sc.  Paris,  292  ,  pp.  929-932,  1981. 

[7|  Koenderink,  Jan  J.,  “What  Does  the  Occluding  Con¬ 
tour  Tell  Us  About  Solid  Shape?”,  Perception,  vol  13, 
1984,  pp.  321-330. 

(8]  O’Neill,  B.,  Elementary  Differential  Geometry,  Aca¬ 
demic  Press,  New  York,  1966. 

[9]  Langevin,  R.,  Levitt,  G.,  and  Rosenberg,  H.,  “Heris- 
sons  et  Multiherissons”,  preprint. 


Figure  1:  The  critical  set  Em  and  profile  of  the  projection  of  M 


905 


Figure  2:  The  profile  of  a  plane  curve  is  a  set  of  points  on  the  viewing  line 


Figure  3:  An  inflexion  in  C  produces  a  cusp  in  the  pedal  curve 


Figure  4:  Examples  of  pedal  curves  for  a  cusp  and  an  inflexion 


,*«.««  P0-tlOn  A 


Figure  5:  A  point  where  the  pedal 
carve  crosses  itself 


Figure  6:  The  viewing  line  for  perspec* 
th'3  projection 


‘Mgent  to  profile  | 


Figure  7:  The  tangent  plane  at  p  contains  the  viewing  direction 
and  a  vector  parallel  to  the  tangent  to  the  profile  curve 


Figure  8:  The  viewing  direction  and  the  tangent  to  the  critical  set 
are  conjugate 


Figure  0:  a.  A  drawing  of  the  limacon  from  initial  data.  b.  The 
reconstruction  of  the  limacon. 


908 


USING  OCCLUDING  CONTOURS  FOR  OBJECT  RECOGNITION 


Willie  Lim 

Massachusetts  Institute  of  Technology 
Artificial  Intelligence  Laboratory 
Cambridge,  MA  02139 


Abstract 

This  paper  describes  an  algorithm,  in  the  area  of  sym¬ 
bolic  high  level  vision,  that  is  implemented  on  the  Con¬ 
nection  Machine.  The  domain  chosen  is  the  rocks  world 
where  objects  are  rocks  or  mountains.  These  objects  are 
represented  using  qualitative  surface  models.  The  algo¬ 
rithm  takes  a  library  of  such  models  and  loads  them  all  into 
the  Connection  Machine.  Candidate  images  are  processed 
to  extract  occluding  contours  which  are  then  matched  in 
parallel  to  all  the  library  models.  The  algorithm  produces 
all  the  views  of  the  models  that  match  the  occluding  con¬ 
tours. 

1  Introduction 

In  the  rocks  world,  objects  are  hard  to  describe  using  quan¬ 
titative  surface  models.  For  this  reason,  a  qualitative  model 
is  used.  In  this  model  an  object  is  partitioned  into  sur¬ 
face  patches.  Each  surface  patch  is  qualitatively  typed  ac¬ 
cording  to  its  surface  shape  e.g.  flat  or  doubly  convex.  A 
graph  is  constructed  with  the  nodes  representing  the  sur¬ 
face  patches  and  arcs  representing  the  neighborhood  rela¬ 
tionships  between  the  patches.  The  arrangement  of  surface 
patches  vary  from  object  to  object  in  an  unpredictable  way. 
Thus  it  is  very  unlikely  that  two  rocks  will  have  an  identical 
arrangement  of  surface  patches.  This  is  unlike  man-made 
objects  where  objects  can  be  made  to  have  identical  shapes. 
It  is  the  unpredictable  nature  of  such  surface  properties  in 
the  rocks  world  that  enables  the  physical  constraints  im¬ 
posed  by  these  qualitatively  described  surface  patches  to 
be  effective  for  shape  recognition. 

This  paper  describes  how  a  convex  object  in  the  rocks 
world  can  be  recognized  using  its  occluding  contour  derived 
from  an  image  of  the  object.  The  emphasis  here  is  illus¬ 
trating  how  the  Connection  Machine  [3]  can  be  used  for 
object  recognition,  and  in  particular,  in  shape  recognition 
in  the  rocks  world.  A  survey  of  Connection  Machine  algo¬ 
rithms  for  other  applications  is  given  in  |4j.  The  algorithm 
presented  in  this  paper  is  implemented  in  *  LISP  (1). 


The  paper  is  organized  as  follows.  The  various  terms 
used  in  the  paper  are  defined  in  Section  2.  Several  con¬ 
straints  are  used  in  the  algorithm.  They  are  discussed  in 
Section  3.  Section  4  describes  the  algorithm  and  the  con¬ 
clusion  of  the  paper  is  given  in  Section  5. 


2  Occluding  Contours  of  Rocks 

The  surface  of  a  rock  is  partitioned  into  surface  patches, 
each  of  which  is  assigned  a  surface  type  that  qualitatively 
describes  the  overall  shape  of  the  surface  patch.  For  ex¬ 
ample  a  surface  patch  that  appears  flat  will  be  typed  as 
flat.  Such  characterization  of  surface  patches  are  relative 
to  the  neighboring  surface  patches.  Hence  it  is  possible  for 
a  curved  surface  patch  on  one  rock  to  be  classified  as  flat 
on  another  rock.  The  matching  algorithm  presented  in  this 
paper  exploits  the  constraints  imposed  on  each  other  by 
these  surface  patches. 

Using  the  surface  patches,  a  rock  is  modeled  as  a  graph 
termed  a  model  graph.  Each  node  in  the  model  graph  cor¬ 
responds  to  a  surface  patch  and  each  arc  represents  the 
adjacency  (i.e.  neighborhood)  relationships  of  the  surface 
patches.  The  recognition  task  starts  with  an  occluding  con¬ 
tour  i.e.  a  closed  curve  that  marks  the  discontinuity  in 
depth  between  the  object  and  the  background  [6].  The 
occluding  contour  is  partitioned  into  segments  which  are 
assigned  types  called  contour  types  e.g.  straight  or  (dou¬ 
bly)  curved.  For  exposition  in  this  paper,  the  number 
of  surface  types  is  limited  to  two — flat  or  curved.  Simi¬ 
larly  the  number  of  contour  types  is  limited  to  two.  In 
practice  the  number  of  surface  types  can  be  increased  to 
six — doubly  curved  convex,  doubly  curved  concave,  sad¬ 
dle  shaped,  singly  curved  convex,  singly  curved  concave, 
and  flat.  The  number  of  contour  types  can  be  increased  to 
three — convex,  concave  and  straight. 

The  sequence  of  symbols  representing  the  contour 
types  of  an  occluding  contour  traversed  in  a  given  direction 
is  termed  a  contour  string.  For  a  surface  patch  to  match  a 
symbol  of  the  contour  string,  the  surface  patch  must  have  a 


909 


(b)  Occluding  contour  for  end  view  (c)  Matching  nodes  for  end  view 


Since  an  occluding  contour  is  a  closed  curve,  the  string 
of  nodes  that  matches  the  contour  string  must  form  a  cycle. 
Such  a  string  of  nodes  is  termed  a  node  string.  It  partitions 
the  model  graph  into  two  halves  which  correspond  to  the 
two  complementary  “front”  and  “back”  views  of  the  object 
that  produce  the  same  occluding  contour.  To  distinguish 
between  the  two  views,  further  information  about  the  in¬ 
ternal  structure  of  the  object  has  to  be  derived  from  the 
image.  Such  information  might  be  the  number  and  types 
of  the  surface  patches  inside  the  occluding  contour. 

Throughout  this  paper,  the  example  shown  in  Figure 
1  is  used  for  illustrating  how  the  algorithm  works.  Figure 
1(a)  shows  the  model  graph  for  a  convex  object  with  six 
faces,  three  of  which  are  curved.  An  example  of  such  an 
object  is  a  rectangular  block  (e.g.  a  generalized  cone  2 
with  a  square  cross-section  and  a  straight  spine)  with  two 
of  its  ends  (surface  normals  parallel  to  the  axis  or  direc¬ 
tion  of  sweep)  and  one  side  (surface  normal  perpendicu¬ 
lar  to  the  axis)  curved  instead  of  flat.  The  curved  side  is 
also  adjacent  to  the  bottom  (and  top)  side  of  the  object. 
Suppose  that  a  view  of  the  object  produces  the  occluding 
contour  shown  in  Figure  1(b).  One  of  the  contour  string 
representing  the  contour  would  be  the  string  of  contour 
types— (curved, straight, straight, straight).  Other  possi¬ 
bilities  are  rotations  and  reversals  of  this  string.  The  oc¬ 
cluding  contour  corresponds  to  viewing  the  object  from 
a  point  along  its  axis,  at  a  sufficient  distance  away  from 
the  object.  The  matching  nodes  (and  arcs)  are  shown  in 
dark  lines  in  Figure  1(c).  Another  occluding  contour  is 
given  in  Figure  1(d).  The  contour  string  for  this  view  is 
(straight,  curved,  curved,  curved).  It  corresponds  to  view¬ 
ing  the  object  from  the  top  or  bottom.  Figure  1(e)  shows 
the  matching  nodes  (also  in  dark  lines). 


(d)  Occluding  contour  for  top 


(•)  Matching  nodus  for  top  vitw 


3 


The  Constraints 


Figure  1:  An  example. 


surface  type  that  is  consistent  with  the  contour  type  repre¬ 
sented  by  the  symbol.  Thus  if  the  segment  of  the  occluding 
contour  is  straight,  it  can  only  match  a  flat  surface  patch  in 
the  world  where  surface  types  are  restricted  to  be  doubly 
curved  or  flat.  This  has  to  be  true  as  each  occluding  con¬ 
tour  segment  is  due  to  a  surface  patch  that  grazes  the  line 
of  sight.  Thus  if  the  occluding  contour  segment  is  curved, 
the  surface  must  be  curved.  Furthermore  surface  patches 
matching  adjacent  symbols  in  the  contour  string  must  also 
be  adjacent  in  the  model  graph. 


The  algorithm  generates  all  possible  matches  for  a  given 
contour  string  i.e.  all  possible  node  strings  (and  hence 
views).  As  the  algorithm  progresses  the  number  of  pos¬ 
sibilities  is  reduced  by  applying  five  constraints. 

The  first  of  these  constraints  is  the  type  consistency 
constraint :  the  contour  type  of  the  contour  segment  being 
matched  must  be  consistent  with  the  surface  type  of  the 
surface  patch.  This  means  that  a  fiat  contour  segment  can 
only  match  a  flat  surface  and  a  curved  segment  a  curved 
surface. 

The  second  constraint  is  the  adjacency  constraint:  if 
•a  node  matches  the  position,  «,  where  0  <  »  <  L  -  1  and 
L  is  the  length  of  the  contour  string,  it  must  be  adjacent 
to  nodes  whose  types  are  consistent  with  the  contour  types 
at  positions  (i  —  1)  mod  L  and  (»  +  1)  mod  L.  In  other 
words  the  adjacency  constraint  involves  the  application  of 


910 


the  type  consistency  constraint  to  the  node  and  two  of  its 
neighbors  given  that  the  node  is  at  position  t.  Thus  if  the 
node  x  matches  position  i  and  its  neighboring  nodes.  ie 
and  y,  match  positions  (i  --  1)  mod  L  and  (i  +  1)  mod  L , 
respectively,  then  wxy  is  a  possible  substring  of  the  node 
string.  This  constraint  and  the  type  consistency  constraint 
are  applied  at  the  initialization  phase  of  the  algorithm. 

It  is  assumed  that  surface  patches  are  not  so  large  as 
to  cover  the  object  from  one  side  to  the  other.  This  means 
that  a  third  constraint  be  used.  This  is  the  uniqueness 
constraint:  nodes  in  the  node  string  must  be  unique.  This 
means  that  as  the  node  string  is  being  built,  the  nodes 
being  added  must  be  new  ones. 

Since  an  occluding  contour  is  a  closed  curve,  there  is 
a  fourth  constraint  called  the  wrap  around  constraint:  the 
first  and  last  nodes  in  the  node  string  must  be  adjacent. 
This  constraint  is  used  at  the  end  of  the  algorithm  for  elim¬ 
inating  illegal  node  strings. 

The  fifth  constraint  is  the  two-neighbor  constraint:  if 
the  length  of  the  contour  string  is  greater  than  3,  each 
node  in  the  node  string  cannot  be  adjacent  to  more  than 
two  other  nodes  in  the  string.  Applying  this  constraint  is 
easy.  Just  make  sure  that  the  node  at  position  i  does  not 
have  more  than  two  neighbors  in  the  node  string.  This 
constraint  limits  the  number  of  views  that  can  be  han¬ 
dled  by  the  matcher.  For  example  if  four  surface  patches 
meet  at  a  corner  such  that  when  the  object  is  viewed  from 
directly  above  the  corner,  they  produce  an  occluding  con¬ 
tour.  all  the  nodes  in  the  corresponding  node  string  (with 
L  ~  1)  will  be  adjacent  to  the  rest  of  the  nodes  in  the 
string.  If  it  is  known  that  there  is  a  corner  inside  the  oc¬ 
cluding  contour,  this  constraint  can  be  ignored.  With  this 
in  mind,  a  more  comprehensive  version  of  the  constraint 
is  being  developed  to  overcome  this  limitation.  Informa¬ 
tion  about  the  internal  structure  of  the  rock  will  have  to 
be  used.  For  example  if  the  number  and  types  of  internal 
surface  patches  are  known,  the  type  consistency,  adjacency, 
and  uniqueness  constraints  can  be  applied  to  the  nodes  rep¬ 
resenting  these  surface  patches.  Quantitative  information 
relevant  to  relative  surface  areas  and  orientations  can  also 
be  used  to  further  reduce  the  number  of  possibilities.  Such 
information  will  be  useful,  for  example,  for  inferring  when 
surface  patches  occlude  one  another  and  hence  cannot  be 
in  the  same  node  string.  Information  about  surface  areas 
will  also  be  useful  in  determining  when  the  uniqueness  con¬ 
straint  can  be  applied. 

4  Matching  a  Model  to  an  Occlud¬ 
ing  Contour 

Model  graphs  are  stored  on  the  Connection  Machine  by 

assigning  a  node  to  each  processor.  Nodes  within  a  model 


graph  are  uniquely  numbered.  Each  processor  contains  a 
table  mapping  node  numbers  to  processor  addresses.  The 
•  list  of  neighboring  nodes  and  their  surface  types  are  also 
stored  in  each  processor  as  a  list  of  pairs  of  node  number 
and  surface  type. 

The  algorithm  starts  by  broadcasting  the  contour 
string  to  all  processors  in  the  Connection  Machine.  Each 
processor  then  goes  through  an  initialization  step  which 
involves  finding  the  positions  in  the  contour  string  that 
the  processor  can  map  to.  The  corresponding  initial  sub¬ 
strings  involving  the  node  and  two  of  its  neighbors  are  also 
computed.  Then  the  adjacency  and  uniqueness  constraints 
are  applied  as  shown  below.  Note  the  constraints  are  ap¬ 
plied  by  calling  functions  that  return  booleans  and  con¬ 
catenations  are  denoted  by  the  symbol  o.  After  initializa¬ 
tion,  every  processors  will  have  list  of  pairs  of  values  i.e. 
(position,  substring).  The  substring  component  of  these 
pairs  is  updated  until  the  complete  node  strings  are  ob¬ 
tained. 

foreach  processor  p 
foreach  position  i  in  the  contour  string 
if 

type-consistency-constraint(contour-type(i)purface-type(p)) 

then 

forall  neighbors  x,  y  of  p 

if  adjacency-constraint(contour-type((i  -  1)  mod  L). 

surface-type(z), 
contour-type((i 1)  mod  L), 
surface-type(y)) 

and  uniqueness-constraint(z,p,  y) 
then  collect  (t',iopoj/) 

in  list-of-legal-subst rings- for-node-st ring 

endif 

endforall 

endif 

endforeach 

endforeach 

In  the  rest  of  the  matching  algorithm  the  sub¬ 
strings  are  lengthened  by  “distance-doubling”  or  “pointer¬ 
jumping”  [5,7].  That  is,  the  substrings  are  lengthened  by- 
concatenating  longer  and  longer  substrings  which  are  ob¬ 
tained  by  communicating  with  processors  that  are  further 
and  further  away.  The  length  of  these  substrings  and  the 
distance  between  processors  doubled  at  each  iteration.  The 
substring  currently  being  computed  in  a  processor  is  di¬ 
vided  into  three  parts — left,  center  and  right.  In  the  rest 
of  this  paper,  this  triple  of  substrings  is  represented  as 
(left, center, right). 

Figure  2  illustrates  the  case  for  1 7  linearly  linked  nodes 
(i.e.  each  node  is  adjacent  to  its  immediate  neighbors). 
Consider  the  node  i  in  the  middle  of  the  column  of  nodes. 
At  initialization,  its  left,  center  and  right  substrings  are 
h,  i  and  j,  respectively.  Similarly  nodes  h  and  j  have  the 
substrings  (g,h,i)  and  (i,j,h),  respectively.  The  substrings 


9 1 1 


- 


are  marked  next  to  "dark  dots.  The  lines  marks  the  leftmost 
and  rightmost  nodes  of  the  left  and  right  substrings,  respec¬ 
tively,  at  the  end  of  each  iteration.  Dot  0  represents  the 
state  of  the  substrings  for  node  i  just  after  initialization. 
At  the  first  iteration,  each  node  sends  to  the  rightmost 
(leftmost)  node  of  its  right  (left)  substrings  its  current  left 
(right)  substrings.  That  is,  node  h  sends  its  left  substring 
g  and  node  j  its  right  substring  k  to  node  i.  Node  i  up¬ 
dates  its  left  and  right  strings  to  gh  and  jk,  respectively, 
by  concatenating  the  substring  it  receives  from  the  right 
(left)  node  to  its  current  right  (left)  substring.  This  is  in¬ 
dicated  in  Figure  2  by  dot  1  (to  the.  right  of  dot  0).  At 
the  same  time,  node  g  has  the  substrings  ( ef,g,hi )  (the 


dark  dot  vertically  above  dot  1)  by  receiving  substrings 
from  nodes  e  and  i.  Similarly,  node  k  has  the  substrings 
( ij,k,lm )  (the  dark  dot  vertically  below  dot  l)  using  infor¬ 
mation  sent  to  it  by  nodes  «  and  m.  At  the  second  iteration, 
node  i  receives  information  from  nodes  g  and  k  resulting  in  ’ 
the  new  substrings  ( efgh,ijklm )  (see  dot  2).  Meanwhile  j 
nodes  e  and  m  update  their  substrings  to  (abed,  e,  fghi )  and 
(ijkl,  m,  nopq),  respectively.  At  the  third  iteration,  nodes 
e  and  m  communicate  with  node  i  causing  it  to  update  its 
substrings  to  (abcdefgh,  i,jktmnopq)  (see  dot  3).  It  can  be 
seen  that  the  lengths  of  the  left  and  right  substrings  dou¬ 
ble  at  each  iteration.  The  process  terminates  after  0( log  L) 

steps.  4 

i 

1 1 

■'  f'  '  ' 


For  the  general  case,  each  node  can  exchange  infor¬ 
mation  with  more  than  a  pair  of  nodes.  This  is  because  a 
node  can  be  on  more  than  one  chain  of  nodes.  The  exact 
number  of  nodes  involved  can  be  as  high  as  the  degree  of 
the  node  in  the  model  graph.  Additional  information  has 
to  be  exchanged  to  indicate  which  left  or  right  substrings 
should  be  updated  since  a  node  can  exchange  information 
with  another  node  multiple  times.  Moreover,  the  unique¬ 
ness  and  two-neighbor  constraints  are  continuously  applied 
to  the  new  left,  center  and  right  substrings.  Substrings  that 
do  not  satisfy  this  constraint  are  rejected. 

As  an  illustration  of  the  more  general  case,  consider 
the  example  shown  in  Figure  1(b).  The  initial  left  and  right 
substrings  obtained  for  the  various  matching  positions  are 
shown  in  Table  1  (and  Table  2  for  Figure  1(c)).  The  center 
substring  (not  shown  in  the  third  column  of  the  Table)  is 
the  node  itself.  Node  a  (same  for  node  d)  matches  position 
1  (and  3).  For  that  position,  one  of  the  initial  substrings 
are  (6,a,e). 


Nodes 

Positions 

Substrings  (left, right) 

a,d 

1 

(b,  e),(c, «),(/,«) 

3 

(e,b),(e,  c),  (e,  /) 

b 

0 

(a,d),(d,  a) 

cj 

0 

(a,  d),  (a,  e),  (d,  a),  (d,  e),  (e,  a),  (e,  d) 

€ 

1 

(c,a),(c,d),(f,a),(f,d) 

2 

(a,d),(d,a) 

3 

(a,c),[a,  f),(d,c),(d,  f) 

Table  1:  Initial  positions  and  substrings  for  an  end  view. 


Consider  what  happens  in  node  b  at  the  first  iteration. 
It  exchanges  information  with  nodes  a  and  d.  From  node 
a  it  is  supposed  to  insert  the  left  substring  e  with  a  at 
position  3  (i.e.  with  e  to  be  placed  at  position  2)  and  the 
right  substring  e  at  position  2  with  a  at  position  1.  This 
produces  the  substrings  (ea,  6,  d)  and  (d,  6,  ae)  with  node  b 
at  position  0.  With  the  information  obtained  from  node  d, 
the  result  is  the  same.  Note  that  redundant  substrings  are 
removed.  Since  L  =  4,  the  iteration  stops.  It  turns  out 
that  the  node  strings  generated — bdea  and  baed — match 
the  contour  string.  These  are  also  the  node  strings  for 
nodes  b,  d  and  e. 

A  more  interesting  example  is  node  c  (or  /).  This 
node  communicates  with  three  other  nodes:  a,  d,  and  e. 
Using  information  from  o,  it  gets  the  substrings:  (d,c,ae), 
(ea,  c,d),  (ea,c,e),  (e,c,ae).  The  first  two  substrings  are 
eliminated  by  the  two-neighbor  constraint  since  the  nodes 
a,  d,  and  e  are  all  neighbors  of  c.  The  last  two  substrings 
are  eliminated  by  the  uniqueness  constraint.  Similarly  the 
substrings  (a,e,de),  (ed,e,a),  (ed,c,e),  (e,c,de)  obtained 
through  communication  with  node  d  are  eliminated.  Com¬ 
munication  with  node  e  produces  the  substrings  (a,c,ea), 
(d,c,ea),  (ae,c,a),  (a«,c,d),  (a,e,ed),  (d,c,ed),  (de,c,a), 


Nodes 

Positions 

Substrings  (left, right) 

a,d 

0 

(b,c),(b,f),(e,b),(c,f),(f,b),(f,c) 

b 

1 

(a,  c),(a,f),(d,c),(d, }) 

2 

(c,/),(/,e) 

3 

(c,a),(c,d),(f,a),(f,d) 

cj 

1 

(a,b),(d,b),(e,b) 

3 

(b, a),  (6, d),(6, e) 

t 

0 

(c,/),(/,c) 

Table  2:  Initial  positions  and  substrings  for  a  top  view. 


and  (de,c,d)  ail  of  which  are  eliminated  by  the'  uniqueness 
and  two-neighbor  constraints.  Thus  node  c  (and  /)  cannot 
be  in  the  node  string. 

For  the  top  view  example,  the  node  strings  are  ecbf 
and  efbe.  The  nodes  which  computed  these  strings  are 
nodes  b,  c,  e,  and  /.  Nodes  a  and  d  produce  no  node 
strings. 

After  the  node  strings  are  found,  they  are  sent  to  the 
host  making  sure  that  at  only  one  processor  per  model 
graph  communicates  with  the  host.  Each  processor  that 
has  a  node  string  checks  to  see  if  its  own  node  number 
is  the  smallest  in  the  node  string.  If  so  then  it  sends  its 
node  strings  to  a  designated  processor  in  its  group  (i.e.  the 
processors  that  implement  the  model  graph).  This  proces¬ 
sor  might  be  designated  by  simply  having  the  lowest  node 
number  or  processor  address  among  the  group  of  nodes  or 
processors  implementing  the  model  graph.  After  collect¬ 
ing  all  these  node  strings,  the  designated  processor  sets  a 
flag  to  indicate  that  it  has  at  least  one  node  string.  This 
flag  is  then  used  to  select  the  processors  for  accessing  their 
node  strings.  Knowing  the  selected  processors  and  their 
node  strings,  the  host  can  determine  which  model  and  view 
match  the  contour  string. 

In  estimating  the  time  complexity  of  the  algorithm,  it 
is  useful  to  note  that  the  time  it  takes  to  exchange  messages 
is  the  most  important.  If  one  assumes  that  this  is  to  be  true 
and  that  at  any  iteration,  information  can  be  exchanged 
with  at  most  N  nodes  where  N  is  the  maximum  number 
of  nodes  in  a  model  graph,  the  algorithm  takes  0{N  logL) 
information  exchanges. 

5  Conclusion 

This  paper  shows  how  a  parallel  matcher  for  shape  recogni¬ 
tion  in  the  rocks  world  can  be  implemented  on  the  Connec¬ 
tion  Machine.  The  matcher  uses  constraints  derived  from 
qualitative  models.  For  any  given  occluding  contour  the 
time  it  takes  to  find  at  one  match  is  the  same  as  that  it 
takes  to  find  any  number  (including  zero)  of  matches.  This 
means  that  it  is  relatively  easy  to  determine  if  a  new  model 


913 


needs  to  be  added  to  the  data  base.  The  set  of  matches 
computed  form  an  equivalence  class  which  can  be  used  as  a 
data  base  (a  much  smaller  one)  for  a  more  elaborate  and  ac¬ 
curate  matcher  that  uses  more  quantitative  models.  With¬ 
out  reducing  the  size  of  the  data  base  such  a  matcher  might 
turn  out  to  be  too  expensive  computationally  to  use. 


6  Acknowledgment 


The  author  would  like  to  thank  Rod  Brooks  for  his  com¬ 
ments  in  this  work. 


References 

1  The  Essential  *Lisp  Manual.  Thinking  Machines  Cor¬ 
poration,  Cambridge.  Massachusetts,  1986.  Thinking 
Machines  Technical  Report  86.15. 

2  Dana  H.  Ballard  and  Christopher  M.  Brown.  Computer 
Vision.  Prentice-Hall,  Inc.,  Englewood  Cliffs,  New  Jer¬ 
sey,  1982. 

3  \V.  Daniel  Hiltis.  The  Connection  Machine.  The  MIT 
Press,  Cambridge.  Massachusetts,  1985. 

A  W.  Daniel  Hillis  and  Guy  L.  Steele.  Jr.  Data  parallel 
algorithms.  Communications  of  the  ACM.  29(1 2) :  1 170- 
1183,  December  1986. 

5  Willie  Y-P.  Lim.  Fast  Algorithms  for  Labeling  Con¬ 
nected  Components  in  2-D  Arrays.  Technical  Report. 
Thinking  Machines  Corporation,  Cambridge,  Mas¬ 
sachusetts,  November  1986.  To  be  published. 

6  David  Marr.  Vision.  W.  H.  Freeman  and  Company. 
San  Francisco,  California,  1982. 

7  J.  C.  W’yllie.  The  Complexity  of  Parallel  Computations. 
Technical  Report  TR  79-387,  Department  of  Computer 
Science,  Cornell  University,  Ithaca,  New  York,  August 
1979. 


PARALLEL  OPTICAL  FLOW  COMPUTATION 
James  Little,  Heinrich  Buithoff  and  Tomaso  Poggio 


Massachusetts  Institute  of  Technology 
Artificial  Intelligence  Laboratory 


Abstract 

We  outline  a  new,  parallel  and  fait  algorithm  for  com¬ 
puting  the  optical  flow.  The  algorithm  it  suggested  by  a 
regularization  method  that  we  call  “ constraint  method”. 
The  method,  based  on  a  theorem  of  Tikhonov,  enforces 
local  constraints  and  leads  to  efficient,  parallel  algo¬ 
rithms.  The  specific  constraint  exploited  by  our  algo¬ 
rithm  can  be  shown  to  correspond,  in  its  most  general 
form,  to  S-D  rigid  motion  of  planar  surfaces.  An  initial 
segmentation  of  the  motion  field  can  be  obtained  from 
the  optical  flow  field  generated  by  the  algorithm.  We  also 
suggest  an  iterative  scheme  that  can  provide  fast,  ap¬ 
proximate  solutions  and  refines  them  subsequently.  We 
discuss  the  implementation  of  the  algorithm  on  the  Con¬ 
nection  Machine™  system  and  its  near  real-time  per¬ 
formance  on  synthetic  and  natural  images. 


1.  Introduction 

The  computation  of  motion  is  an  important  module  of 
early  vision.  It  is  potentially  useful  for  computing  the 
3-D  structure  of  surfaces,  for  segmenting  a  scene  into 
objects  and  as  a  control  module  for  navigation.  The 
3-D  motion  field,  however,  or  even  its  2-D  projection, 
cannot  be  directly  measured  from  the  time  sequence  of 
images.  Much  effort  in  recent  years  has  been  invested 
in  analyzing  ways  to  compute  the  optical  flow  -  the  2-D 
field  of  motion  of  brightness  in  the  image.  A  common 
assumption  is  that  the  optical  flow,  suitably  defined,  is 
a  very  close  approximation  to  the  2-D  projection  of  the 
true  3-D  motion  field.  Verri  and  Poggio  (these  Proceed¬ 
ings)  argue  against  this  assumption  and  the  related  use 
of  the  optical  flow  to  obtain  a  precise  quantitative  esti¬ 
mate  of  3-D  structure  and  motion  under  general  condi¬ 
tions.  They  argue  also,  however,  that  qualitative  prop¬ 
erties  of  the  optical  flow  are  very  useful  for  segmenting 
the  scene  into  different  objects  and  for  providing  infor¬ 
mation  about  the  type  of  motion  and  the  3-D  structure, 


and  that  they  are  more  robust  than  quantitative  esti¬ 
mates.  Notice  that  a  closed-loop  control  system  does 
not  need  accurate  estimates.  Furthermore,  Verri’s  anal¬ 
ysis  suggests  that  the  precise  definition  of  optical  flow  is 
not  critical  as  long  as  it  preserves  the  qualitative  prop¬ 
erties  of  the  2-D  motion  field. 

In  this  paper  we  outline  a  new,  robust  and  fast  al¬ 
gorithm  for  computing  a  version  of  the  optical  flow  that 
was  suggested  by  a  regularization  method  that  we  call 
“constraint  method”.  The  algorithm  has  been  imple¬ 
mented  on  the  Connection  Machine™  system.  Com¬ 
paratively  little  work  in  the  area  of  motion  computation 
‘has  made  use  of  sequences  of  real  images,  because  of 
technical  limitations  in  acquiring  and  processing  them. 
Our  algorithm,  which  runs  in  times  that  are  typically 
of  the  order  of  a  few  seconds,  is  routinely  used  with  im¬ 
age  sequences  grabbed  by  the  Vision  machine’s  eye-head 
system  (Poggio  and  staff,  these  Proceedings). 

2.  The  algorithm 

We  can  formulate  a  first  algorithm  in  terms  of  the  sim¬ 
plest  possible  assumption,  that  the  optical  flow  is  lo¬ 
cally  uniform.  This  assumption  is  strictly  only  true  for 
translational  motion  of  3-D  planar  surface  patches  par¬ 
allel  to  the  image  plane.  It  is  a  restrictive  assumption 
that,  however,  may  be  a  satisfactory  local  approxima¬ 
tion  in  many  cases.  Let  £t(x,y)  and  Et+cit( x,y)  reP- 
resent  transformations  of  two  discrete  images  separated 
by  time  interval  Af,  such  as  filtered  images  or  a  map 
of  the  intensity  changes  in  the  two  images  (more  gen¬ 
erally,  they  can  be  maps  containing  a  feature  vector  at 
each  location  x,y  in  the  image).  We  look  for  a  motion 
displacement  (also  discrete)  y  =  (vt,  vy )  at  each  discrete 
location  x,y  such  that 

||E*(*,y)  — Et+At(*  +  vIAt,y+vvAt)||p»tch,  =  min  (1) 

where  the  norm  is  over  a  local  neighborhood  centered 
at  each  location  x,  y  and  y(x,  y)  is  assumed  constant  in 


915 


the  neighborhood.  Equation  (1)  implies  that  we  should 
look  at  each  x,  y  for  v  =  (vz ,  uy )  such  that 

I  \Et(xyy)Et+At(x  +  vz&t,y  +  vy&t)\dxdy  (2) 

J  P4tch, 

is  maximized.  Equation  (2)  represents  the  correlation 
between  a  patch  in  the  first  image  centered  around  the 
location  x,  y  and  a  patch  in  the  second  image  centered 
around  the  location  x  +  vz  At,  y  +  vy  At. 

This  algorithm  can  be  translated  easily  into  the 
following  description.  Consider  a  network  of  processors 
representing  the  result  of  the  integrand  in  equation  (2). 
Assume  for  simplicity  that  this  result  is  either  0  or  1. 
(This  is  the  case  if  Et  and  Et+At  are  binary  feature 
maps.).  The  processors  hold  the  result  of  multiplying 
(or  logically  “anding” )  the  right  and  left  image  map  for 
different  values  of  x,y  and  vz,vv.  The  next  stage,  cor¬ 
responding  exactly  to  the  integral  operation  over  the 

patch,  is  for  each  processor  to  count  how  many  pro¬ 
cessors  are  active  in  an  x,y  neighborhood  at  the  same 
disparity.  Each  processor  thus  collects  a  vote  indicat¬ 
ing  support  that  a  patch  of  surface  exists  at  that  dis¬ 
placement.  The  last  stage  is  to  choose  y(x,  y)  out  of 
a  finite  set  of  allowed  values  that  maximizes  the  inte¬ 
gral.  This  is  done  by  an  operation  of  “non-maximum 
suppression”  across  velocities  out  of  the  finite  allowed 
set:  at  the  given  x,  y,  the  processor  is  found  that  has 
the  maximum  vote.  The  corresponding  y(x,y)  is  the 
velocity  of  the  surface  patch  found  by  the  algorithm. 
This  algorithm  is  similar  to  the  stereo  algorithm  imple¬ 
mented  by  M.  Drumheller  and  T.  Poggio  (1986)  on  the 
Connection  Machine™  system. 

The  algorithm  will  not  find,  in  general,  the  2-D  pro¬ 
jection  of  the  true  3-D  velocity  field.  This  will  happen 
only  when  the  features  used  for  matching  correspond  to 
markings  on  the  3-D  surfaces  and  when  either  the  fea¬ 
tures  are  sparse  (no  ambiguity)  or  the  disambiguation 
step  (the  voting  and  non-maximum  suppression  stage) 
finds  the  true  correspondence  (i.e.  the  underlying  as¬ 
sumptions  are  satisfied).  Even  when  the  result  is  not 
the  true  motion  field,  the  algorithm  will  usually  pre¬ 
serve  its  most  important  qualitative  properties. 

3.  The  constraint  method 
The  algorithm  just  described  was  suggested  by  a  reg¬ 
ularization  method  (Poggio  and  Verri,  in  preparation) 
that  follows  directly  from  some  results  of  Tikhonov  and 
the  discrete  nature  of  the  image  data.  We  outline  here 
the  basis  of  the  constraint  method  and  then  discuss  how 
it  is  implemented  by  our  motion  algorithm.  We  post¬ 
pone  a  more  formal  discussion  to  a  later  paper. 


'VI.  Tikhonov  Lemma 

As  pointed  out  by  Poggio  and  Torre  (1984)  many  prob¬ 
lems  of  early  vision  are  ill-posed.  For  example  solutions 
to  problems  such  as  surface  reconstruction,  computation 
of  visual  motion  and  depth  from  stereo  are  not  unique  or 
they  are  not  stable  (for  instance  for  edge  detection  and 
structure  from  motion).  A  lemma  by  Tikhonov  and  Ar¬ 
senin  (1977)  suggests  how  to  regularize  these  problems 
by  exploiting  the  discrete  nature  of  the  image  data  and 
the  boundedness  of  sought  solutions  (for  instance  depth 
of  surfaces  or  velocity  values). 

Let  us  consider  the  problem  of  solving  the  equation 

Ax  =  y  (3) 

for  x  6  X ,  X  a  metric  space.  Let  y  £  Y,  Y  a  met¬ 
ric  space,  and  let  A  be  an  operator  mapping  D(A)  C  X 
onto  fi(A)  C  Y .  In  many  applications  it  is  required  that 
the  solution  x  to  (2.1)  i )  exists,  it)  it  is  unique  and  in)  it 
depends  continuously  on  y.  A  problem  whose  solutions 
satisfies  i),ii)  and  tit)  is  said  to  be  well-posed\  otherwise 
it  is  said  to  be  ill-posed.  It  is  clear  that  the  solution  does 
not  exist  if  y  ^  R(A)  and  that  it  is  not  unique  if  A  is  not 
injective.  The  solution  depends  continuously  on  y  when 
the  inverse  of  A,  A'1,  is  continuous.  Let  us  assume  that 
the  solution  to  (2.1)  is  given  by 

£o  =  jnf  ||Ax-y||  (4) 

Sufficient  conditions  for  the  well-posedness  of  (3) 
(or  (4))  are  provided  by  the  following  lemma  by 
Tikhonov: 

Lemma  Suppose  that  the  operator  A  maps  a  compact 
set  F  C  X  onto  the  set  U  C  Y.  If  A  :  F  — *  U  is  contin¬ 
uous  and  one-to-one,  then  the  inverse  mapping  is 
also  continuous. 

The  uniqueness  is  obviously  guaranteed  by  the  fact 
that  A  is  one-to-one  while  both  the  existence  and  the 
stability  of  the  solution  rely  upon  the  compactness  of 
the  set  F.  Let  us  assume  that  X  (or  F)  is  a  closed  and 
bounded  subset  of  Rn.  Therefore  X  (or  F)  is  compact. 
Hence,  if  the  solution  belongs  to  a  closed  and  bounded 
set  in  Rn  and  A  is  one-to-one,  the  problem  to  solve  is 
well-posed.  Since  every  early  vision  problem  is  defined 
on  a  n-dimensional  space  (where  n  is  the  number  of  pix¬ 
els),  sufficient  conditions  for  the  well-posedness  of  these 
problems  can  be  given  setting  a  priori  bounds  on  the 
solutions.  In  the  specific  case  of  our  motion  algorithm 
the  compactness  assumption  is  satisfied  by  restricting 
the  input  and  output  spaces  to  be  discrete  and  finite 
(only  a  small  set  of  discrete  velocities  is  allowed). 


916 


It  is  important  to  stress  that  in  several  early  vision 
problems  the  solution  is  not  unique.  For  example  the 
same  2-D  motion  field  can  be  generated  by  the  projec¬ 
tion  of  different  kinds  of  3-D  motion  fields.  This  corre¬ 
sponds  to  the  problem  of  solving  equation  2  when  A  is 
not  one-to-one.  In  this  case  the  lemma  is  not  sufficient 
since  it  does  not  guarantee  uniqueness.  Let  us  assume, 
however,  that  the  solution  is  still  bounded:  only  a  fi¬ 
nite  number  of  solutions,  then,  are  possible  due  to  the 
quantization.  Uniqueness  can  be  achieved  by  giving  a 
set  of  rules  that  allow  the  selection  of  at  most  one  value 
(possibly  none)  for  the  solution  at  every  point:  such 
a  strategy  obviously  will  always  succeed  due  to  the  fi¬ 
nite  number  of  possible  values  at  each  location.  It  is 
worth  noting  that  the  solution  could  actually  turn  out 
to  be  defined  only  in  some  locations,  corresponding  to 
the  selection  of  no  value.  The  set  of  rules  corresponds  in 
our  algorithm  to  the  voting  followed  by  non-maximum 
suppression. 

3.2.  Equivalent  physical  constraints 

The  algorithm  can  be  extended  to  a  less  restrictive  con¬ 
straint:  that  the  optical  flow  is  locally  linear  or  even 
quadratic  instead  of  simply  constant.  The  quadratic 
case  corresponds  to  a  very  simple  constraint  on  3-D 
motion  and  surfaces,  under  the  assumption  that  the  op¬ 
tical  flow  is  sufficiently  close  to  the  2-D  motion  field. 
Under  this  assumption  we  can  use  the  following  result 
due  to  Waxman  (1986):  the  velocity  field  on  the  im¬ 
age  plane  originated  by  arbitrary,  rigid  S-D  motion  of 
a  planar  surface  patch  is  quadratic.  In  this  way,  the  al¬ 
gorithm  exploits  the  physical  assumption  that  surfaces 
are — at  least  relative  to  the  image  resolution — locally 
planar  (the  “allowed”  world  is  thus  a  world  of  polyhe¬ 
dral  solids,  albeit  with  a  very  high  number  of  faces). 
It  is  worth  noting  that  our  initial  experiments  indicate 
that  the  quadratic  and  even  the  linear  assumptions  do 
not  change  significantly  the  results  obtained  with  the 
“constant  constraint”  algorithm. 

3.3.  An  iteration  scheme 

The  implementation  of  the  quadratic  patch  constraint 
is  computationally  expensive,  even  for  coarse  discretiza¬ 
tion  of  the  velocity  values.  Though  it  may  be  unneces¬ 
sary  to  consider  in  practice  quadratic  patches,  it  is  of  in¬ 
terest  to  develop  a  scheme  that  allows  for  a  fast  approx¬ 
imate  solution  baaed  on  the  constant  field  assumption 
that  is  then  refined  in  terms  of  higher  order  assumptions 
such  as  linear  and  quadratic  patch.  It  is  natural  to  con¬ 
sider  an  iteration  that  first  finds  the  best  “constant” 


solution,  then  refines  it  with  the  best  “linear”  correc¬ 
tion  and  finally  finds  the  best  “quadratic”  correction. 
In  general,  the  best  quadratic  correction  does  not  pro¬ 
vide  the  best  quadratic  approximation.  Results  however 
about  the  estimation  of  polynomial  operators  (see  for 
instance  Poggio,  1975,  theorem  4.2)  suggest  that  iterat¬ 
ing  the  procedure  should  converge  to  the  best  quadratic 
approximation.  In  this  way  we  can  find  the  best  “con¬ 
stant”  estimation  of  the  optical  flow  and  then  refine  it 
by  successive  iterations  that  cycle  from  the  lowest  to  the 
highest  order  and  to  the  lowest  again. 

4.  The  Connection  Machine™ 
implementation 

The  time  At  between  images  is  small,  on  the  order  of 
one  video  timeframe  (1  /30th  second).  During  this  short 
time,  the  appearance  of  a  moving  object  can  change  due 
to  its  own  motion,  camera  motion,  light  source  motion, 
or  all  three,  among  other  effects  (see  Verri  and  Poggio, 
this  Proceedings).  However,  when  the  local  intensity 
variation  in  the  surface  albedo  is  sufficiently  large,  the 
errors  introduced  by  these  effects  Me  minimized.  For 
this  reason,  we  use  the  output  of  an  edge  detection  step 
as  the  input  to  later  stages  of  optical  flow  computation. 
Both  the  Laplacian  of  a  Gaussian  and  Canny’s  edge  de¬ 
tector  (Canny,  1986)  have  been  used  with  good  results. 
Let  Et  be  an  image  containing  a  description  of  the  fea¬ 
tures  at  time  t.  Features  can,  for  example,  describe  the 
sign  of  the  x  and  y  components  of  the  image  gradient 
at  edge  points. 

The  comparison  of  features  in  Et  and  Et+nt  is  per¬ 
formed  for  each  y  and  is  recorded  at  each  x,  y  in  a  match¬ 
ing  map  M(y),  which  identifies  whether  the  feature  at 
Et(x,y)  found  a  match  in  £t+Ai(z,  y)  when  displaced 
by  (vz,vv)A t  in  the  image  plane.  This  process  is  spa¬ 
tially  parallel,  operating  at  all  x,y  simultaneously,  for 
each  v  in  the  discrete  set  that  is  allowed.  The  process 
iterates  over  this  range,  generating  the  matching  map 
M{y). 

The  integration  stage  acts  only  upon  the  match¬ 
ing  maps  M(v).  By  assumption,  we  are  looking  for  the 
optical  flow  which  is  locally  constant.  We  choose  the 
flow  y  which,  in  a  neighborhood  N  around  x,y,  max¬ 
imizes  the  number  of  matches.  Again  this  procedure 
iterates  over  all  v  in  the  bounded  range.  We  identify  a 
displacement  only  at  those  locations  x,y  at  which  the 
maximum  vote  is  unique;  ties  are  ambiguous  and  are 
eliminated.  The  result  is  a  map  of  the  optical  flow  at 
locations  in  the  image  at  which  a  moving  feature  has 
undergone  unambiguous  motion. 


917 


The  comparison  and  integrated  stages  could  be 
merged  into  one  stage,  in  which  case  we  would  be  di¬ 
rectly  implementing  a  binary  correlation  scheme  on  the 
feature  maps. 

The  constraint  method  can  also  be  used  to  ex¬ 
ploit  more  sophisticated  and  less  restrictive  assumptions 
about  surfaces.  We  could,  for  instance,  assume  that  the 
motion  surface  is  locally  planar,  or  that  the  patch  is 
locally  quadratic.  These  more  general  assumptions  can 
be  implemented  by  changing  appropriately  the  geome¬ 
try  of  the  neighborhoods  over  which  the  voting  takes 
place.  We  have  implemented  the  iterative  scheme  for 
determining  the  best  linear  correction  (linear  in  x,y) 
to  the  constant  solution.  There,  the  separation  of  the 
two  stages  is  crucial  for  a  fast  implementation;  for  cor¬ 
rections  that  are  not  constant,  the  voting  neighborhood 
no  longer  corresponds  to  a  simple  patch  in  the  match¬ 
ing  map  of  just  one  displacement.  Also,  the  comparison 
operation  can  be  done  once  and  used  for  all  later  inte¬ 
gration  stages. 


Figure  1  A  disc  with  random  texture  is  embedded  in  a 
background  pattern  with  the  same  texture.  The  shape  of 
the  object  becomes  immediately  visible  when  foreground  and 
background  pattern  are  moving  with  different  speeds  (Figure 
2). 

4.1.  Examples 

We  used  both  synthetic  and  reed  images  for  testing 
the  implementation  of  the  algorithm  on  the  Connec¬ 
tion  Machine™  system.  The  algorithm  requires  as 
much  structure  as  possible  in  the  images  to  be  ana¬ 
lyzed.  In  the  synthetic  images  we  produce  sufficient 
structure  by  using  random  textures  for  the  foreground 


Figure  2.  Needle  diagram  of  the  motion  field  computed  by 
the  constraint  method.  A  disc  with  random  texture  is  mov¬ 
ing  with  half  speed  in  the  same  direction  as  the  background. 
The  algorithm  computes  consistent  velocities  for  foreground 
and  background  motion. 


Figure  3.  Needle  diagram  of  the  motion  field  computed  by 
the  constraint  method.  A  disc  with  random  texture  is  ro¬ 
tating  clockwise  at  2  degrees  per  time  step  in  front  of  a 
stationary  background. 

and  background  patterns  1  (Figure  1).  Since  the  fore¬ 
ground  has  the  same  texture  as  the  background  it  re¬ 
mains  invisible  to  thr  human  observer  as  long  as  it  is 
not  moving.  As  soon  as  either  the  foreground  or  the 
background  pattern  begins  to  move  the  object  becomes 
immediately  visible.  The  needle  diagrams  of  the  opti- 

1  To  prevent  antialiasing  in  the  motion  computation  output 
the  images  were  appropriately  bandpass  filtered 


918 


cal  flow  field  in  Figure  2  show  that  the  algorithm  suc- 
cesfully  computes  consistent  motion  if  foreground  and 
background  patterns  move  with  different  speeds.  In  Fig¬ 
ure  3  the  disc-like  foreground  pattern  is  rotating  clock¬ 
wise  in  front  of  a  stationary  background.  Again  all  mo¬ 
tion  vectors  show  locally  consistent  direction  and  mag¬ 
nitude  of  velocity  for  the  foreground  pattern.  Figure  4 
shows  that  the  Connection  Machine™  implementation 
of  the  constraint  method  is  able  to  compute  a  consistent 
description  of  motion  in  a  natural  sequence  of  images. 
Two  images  are  digitized  in  30  msec  intervals  with  the 
Vision  machine’s  eye-head  sytem  and  analyzed  by  the 
Connection  Machine™.  Almost  all  of  the  significant 
intensity  changes  on  the  moving  robot  are  labelled  with 
motion  vectors  pointing  to  the  left  which  is  consistent 
with  the  actual  motion  of  the  robot.  The  non-moving 
chair  and  the  background  is  invisible  to  the  motion  de¬ 
tection  mechanism.  Only  a  few  small  patches  show  false 
motion  due  to  apparent  motion  of  specular  reflections 
caused  by  changes  in  illumination  between  frames.  On 
a  Connection  MachinerAf  having  16K  processors,  the 
optical  flow  of  a  128  x  128  image  can  be  computed  in  a 
few  seconds,  several  hundreds  of  times  faster  than  on  a 
Symbolics  3640  Lisp  Machine. 


5.  Finding  discontinuities 

We  are  presently  analyzing  several  methods  for  finding 
efficiently  initial  estimates  of  the  discontinuities  of  the 
optical  flow  that  may  be  later  refined  (for  instance  at 
the  integration  stage).  We  give  here  a  very  brief  outline 
of  three  methods: 

5.1.  Statistics  of  the  voting  step 

At  motion  discontinuities  the  assumption  of  a  constant 
(or  linear  or  quadratic)  motion  field  is  obviously  wrong. 
One  would  expect  therefore  that  the  “votes”  at  a  motion 
discontinuity  (say  in  the  case  of  the  “constant”  motion 
algorithm)  would  fail  to  support  clearly  any  single  ve¬ 
locity.  In  fact,  regions  of  close  “ties”,  or  equivalently 
of  winners  with  locally  minimum  votes,  often  delinate 
motion  discontinuities.  Our  implementation  of  this  pro¬ 
cedure  scales  the  number  of  votes  at  a  location  by  the 
total  number  of  features  in  the  voting  neighborhood. 
Close  ties  receive  values  near  0.5.  Figure  5  shows  the 
result  of  thresholding  this  ratio  at  0.75.  This  idea  can  be 
developed  further  by  considering  more  complete  statis¬ 
tics  of  the  votes  (Spoerri  and  Ullman,  in  preparation). 


Figure  4a.  Frame  1  of  2  from  a  motion  sequence  used  to 
test  parallel  motion  detection  algorithms  with  real  images 
(256  x  256).  The  robot  is  moving  to  the  left.  b.  Output 
of  the  constraint  method  algorithm  for  the  motion  sequence 
shown  above.  Most  of  the  needles  point  to  the  left  con¬ 
sistently  with  the  actual  movement.  The  stationary  back¬ 
ground  (wall,  floor  and  chair)  is  invisible  to  the  motion  de¬ 
tectors.  The  small  patches  of  indicated  motion  on  the  left 
are  most  probably  caused  by  apparent  motion  of  specular 
reflections  due  to  changes  in  illumination  between  frames. 

5.2.  Vetoing  coherent  motion 

Motion  discontinuities  can  be  found  by  using  an  algo¬ 
rithm  that  was  suggested  by  data  on  the  insect  visual 
system  (Reichardt  et  al.,  1983).  The  idea  is  to  inhibit 
or  veto  the  value  of  the  optical  flow  at  each  point  by 
the  average  value  of  the  field  over  a  large  region  cen¬ 
tered  at  that  point  whenever  the  motion  is  of  the  same 


919 


type.  The  scheme  suggested  by  the  insect  work  is  the 
following:  at  each  x,y  consider  separately  the  i  and 
the  y  component  of  the  optical  flow,  take  its  value  and 
divide  it  by  the  average  value,  computed  over  a  large 
region.  Figure  6  shows  the  output  of  this  operation  on 
two  examples.  The  average  may  be  Gaussian  weighted. 
It  is  quite  intriguing  to  notice  that  the  basic  operation 
is  very  similar  to  a  recent  proposal  by  Land  (1986).  It  is 
also  similar  to  performing  a  center-surround  operation 
(such  as  the  Laplacian  of  a  Gaussian,  but  with  much 
larger  surround)  on  the  log  of  the  optical  flow. 

5.3.  Edge  detection  on  the  optical  flow 

Edge  detection  on  the  each  component  of  the  optical 
flow  is  the  simplest  way  to  obtain  an  initial  estimate  of 
discontinuities.  We  plan  to  use  Canny’s  edge  detector 
(Canny,  1986)  on  each  component  of  the  optical  flow. 


Figure  5.  Edge  labeling  by  relative  motion.  Motion  disconti¬ 
nuities  can  be  found  by  finding  the  locations  that  get  a  min¬ 
imum  number  of  votes  for  consistent  motion.  This  should 
be  the  case  at  object  boundaries  for  relative  motion  between 
foreground  (object)  and  background,  a.  A  disc  moves  with 
same  speed  but  in  opposite  direction  to  the  moving  back¬ 
ground.  b.  Object  moves  in  same  direction  but  half  the 
speed  of  the  background. 


Figure  6.  Edge  labeling  by  relative  motion,  a.  A  disc  with 
random  texture  is  moving  with  opposite  direction  in  front  of 
a  moving  background  with  the  same  texture.  Only  those 
indicators  of  relative  motion  with  values  above  a  certain 
threshold  are  shown  here.  b.  The  same  pattern  is  moving 
in  front  of  the  same  background  in  the  same  direction  but 
with  half  the  velocity.  The  size  of  the  labeled  area  depends 
on  the  size  of  the  larger  mask  (see  text). 


Reading  list 

Canny,  J.F.  (1986)  A  Computational  Approach  to  Edge 
Detection.  IEEE  Transactions  on  Pattern  Analysis 
and  Machine  Intelligence  Nov.  1986,  Vol.  8,  No. 

6,  679-698. 

Drumheller,  M.  and  Poggio,  T.  (1986)  Parallel  Stereo 
Proceedings  of  IEEE  Conference  on  Robotics  and 
Automation,  San  Francisco. 

Hassenstein,  B.  and  Reichardt,  W.  (1956)  System- 
theoretische  Analyse  der  Zeit-,  Reihenfolgen-  und 
Vorzeichenauswertung  bei  der  Bewegungsperzep- 
tion  des  Riisselkafers  Chlorophanus.  Z.  Natur- 
forsch.  lib,  513-524. 

Hildreth,  E.  C.  (1984a)  The  Measurement  of  Visual  Mo¬ 
tion.  Cambridge:  MIT  Press. 

Horn,  B.  K.  P.  and  Schunck,  B.  G.  (1981)  Determining 
optical  flow.  Artif.  Intel l.  17,  185-203. 

Land,  E.H.  (1986)  An  alternative  technique  for  the  com¬ 
putation  of  the  designator  in  the  retinex  theory  of 
color  vision.  P.N.A.S.  83,  3078-3080. 

Marr,  D.  and  Ullman,  S.  (1981)  Directional  selectivity 
and  its  use  in  early  visual  processing.  Proc.  R.  Soc. 
London  Ser.  B  211,  151-180. 

Poggio,  T.  (1975)  On  optimal  nonlinear  associative  re¬ 
call.  Biol.  Cybern.  19,  201-209. 

Poggio,  T.  and  Torre,  V.  (1984)  Ill-posed  problems  and 
regularization  analysis  in  early  vision.  Massachu¬ 
setts  Institute  of  Technology  Artificial  Intelligence 
Laboratory  Memo  773,  C.B.I.P.  Paper  001. 

Poggio,  T.  and  Reichardt,  W.  (1973)  Considerations 
on  models  of  movement  detection.  Kybemetik  13, 
223-227. 

Reichardt,  W.,  Poggio,  T.  and  Hausen,  K.  (1983) 
Figure-ground  discrimination  by  relative  move¬ 
ment  in  the  visual  system  of  the  fly.  Part  II:  To¬ 
wards  the  neural  circuitry.  Biol.  Cybern.  46,  1-30. 

Tikhonov,  A.N.  and  Arsenin,  V.Y.  (1977)  Solutions 
of  ill-posed  problems.  W.H. Winston,  Washington, 

D.C. 

Torre,  V.  and  Poggio,  T.  (1978)  A  Synaptic  mechanism 
possibly  underlying  directional  selectivity  to  mo¬ 
tion.  Pror.  R.  Soc.  Lond.  B  202,  409-416 

Ullman,  S.  (1981)  Analysis  of  visual  motion  by  biolog¬ 
ical  and  computer  systems.  IEEE  Computer,  Au¬ 
gust  1981,  pp.  57-69. 

Waxman,  A.  M.  (1986)  Image  flow  theory:  A  frame¬ 
work  for  3-D  inference  from  time-varying  imagery. 
In  Advances  in  Computer  Futon,  ed.  C.  Brown, 
New  Jersey:  Erlbaum.  In  press. 


Using  Optimal  Algorithms  to  Test  Model  Assumptions  in  Computer  Vision 


Terrance  E.  Boult’ 

Columbia  University  Computer  Science  Department 


Abstract 

To  develop  useable  alrotithms  to  solve  problems  in  com¬ 
puter  vision  one  inevitably  has  to  make  a  number  of  as¬ 
sumptions  about  the  problem,  and  about  the  world.  At 
times,  we  are  able  to  derive  optimal  error  algorithms.  Un¬ 
fortunately,  as  many  of  us  have  seen  first  hand,  “optimal” 
algorithms,  when  run  on  real  data,  can  give  very  poor  re¬ 
sults.  As  a  result,  the  developed  algorithm  is  then  laid 
to  rest,  a  death  caused  by  assuming  a  world  model  which 
does  not  coincide  with  reality.  However,  such  optimal  al¬ 
gorithms  may  still  serve  a  purpose  in  computer  vision.  At 
times,  these  algorithms  may  be  used  to  help  test  the  as¬ 
sumptions  which  we  can  make  about  the  world  and  thus 
develop  better  models  of  the  world  or  beter  models  of  an¬ 
imal  visual  system. 

After  a  general  discussion  of  the  use  of  optimal  algo¬ 
rithms  in  verification  of  model  assumptions,  we  present 
an  example  from  the  modeling  of  human  reconstruction  of 
surfaces  from  sparse  visual  depth  data.  We  then  discuss 
the  interplay  of  contaminated  data  and  modeling.  We  end 
with  a  short  discussion  of  the  interplay  between  optimal 
error  algorithms,  algorithm  complexity,  and  modeling. 


1  Introduction 

A  large  part  of  the  research  in  computer  vision  can  be  viewed 
as  development  of  models  for  the  extraction  of  information  from 
the  visual  data  or  development  of  models  of  various  components 
in  animal  visual  systems.  Because  most  vision  problems  are  ill- 
posed,  the  models  developed  must  contain  various  assumptions 
about  the  world.  Unfortunately,  analytical  analysis  is  not  prac¬ 
tical  for  models  of  all  processes.  Thus  many  researchers  have 
turned  to  computational  analysis  (simulation)  to  provide  them 
with  means  of  comparing  or  validating  models.  In  addition  to  the 
traditional  problems  of  assessing  the  difference  between  experi¬ 
mental  behavior  and  model  predictions  caused  by  experimental 
errors  and  factors  not  considered  in  the  model,  the  researcher 
using  computational  analysis  must  deal  with  errors  which  are 
caused  by  the  inaccuracy  of  the  “computational  embodiment  of 
the  model” . 

We  highlight  this  last  difficulty  by  a  simple  example.  Suppose 
based  on  analytical  methods  we  can  predict  one  of  two  models, 
Mx  and  Mj,  both  of  which  are  based  on  the  minimization  of  a 
distance  functional.  For  the  sake  of  argument  let  us  assume  the 
distance  functional  is  non-con  vex  for  Mt  but  is  convex  for  A/j. 

’This  work  supported  is  part  by  DARPA  grant  #  N00039-84-C-1065  and 
NSF  grant#  DCR-82-14322. 


Furthermore,  let  us  assume  that  Mi  is  the  correct  model.  As  of¬ 
ten  happens  in  practice,  we  cannot  test  the  two  models  directly, 
but  rather  we  implement  algorithms  (say  ipi  and  <pi)  which  solve 
example  problems  (assuming  models  Mi  and  M i  respectively) 
and  compare  the  results  of  a  number  of  computational  experi¬ 
ments.  For  simplicity,  let  us  assume  the  “quality”  of  a  compu¬ 
tational  experiment  is  proportional  to  the  difference  between  the 
a  known  distance  and  the  estimated  (computed)  minima]  dis¬ 
tance  proposed  by  the  model.  Further,  let  us  assume  that  both 
algorithms  use  a  simple  gradient  descent  algorithm  to  find  the 
minimum  of  their  respective  distance  functionals.  After  a  num¬ 
ber  of  computational  experiments,  we  would  most  likely  conclude 
that  model  Mi  was  better  than  model  Mi.1  This  incorrect  con¬ 
clusion  is  not  caused  solely  by  the  differences  in  the  models  but 
by  the  error  properties  of  the  algorithms  used  in  the  computa¬ 
tional  analysis.  While  the  “computational  error”  was  obvious  in 
this  case  (using  gradient  descent  for  a  non-convex  minimization 
problem),  in  practice  subtle  differences  in  algorithms  (and  as¬ 
sumptions  about  the  problem’s  definition)  can  also  cause  large 
differences  in  the  computational  error. 

The  above  example  shows  that  the  comparison  of  algorithms 
does  not  necessarily  imply  anything  about  the  underlying  models. 
In  the  next  section  we  discuss  how  this  changes  if  the  algorithms 
are  optimal  error  algorithms,  and  how  this  can  be  used  to  verify 
model  assumptions  or  choose  between  different  models.  Then  in 
Section  3  we  present  an  example  of  the  general  approach  applied 
to  problem  of  visual  surface  reconstruction.  We  end  this  paper 
with  a  discussion  of  the  interaction  of  optimal  error  algorithms 
and  complexity. 

2  General  Approach 

There  are  four  phases  in  the  general  approach  of  using  opti¬ 
mal  error  algorithms  to  choose  between  different  mathematical 
models  of  a  process.  These  are: 

1.  the  precise  formulation  of  different  models  of  the  process 

2.  the  theoretical  derivation  of  the  optimal  error  algorithm2 

‘This  conclusion  assumes  that  the  gradient  descent  algorithm  would  often 
get  stuck  in  local  minima  far  from  the  global  minimum  of  the  non-convex 
distance  function,  which  is  generally  the  case  when  applying  it  to  non-convex 
functions. 

2  By  optimal  error  algorithm  we  mean  the  algorithm  with  minimal  error. 
We  therefore  assume  that  the  user  has  chosen  a  measure  of  error,  e.g.  maxi¬ 
mum  over  all  possible  problem  instances  of  the  worst  case  difference  between 
the  true  solution  and  the  approximation.  The  results  of  the  process  will  be 
more  significant  if  the  algorithm  is  strongly  optimal,  i.e.  it  has  minimal  error 
for  every  set  of  initial  data. 


921 


for  each  model  given  in  phase  1, 

3.  the  implementation  (and  analysis  of  the  numerical  stabil¬ 
ity)  of  the  optimal  error  algorithms  derived  in  phase  2 

4.  the  comparison  of  the  different  models,  based  on  the  inter¬ 
pretation  of  the  computational  results  of  the  algorithms  in 
phase  3.  This  is  generally  accomplished  using  a  finite  but 
representative  sample  of  problem  elements. 

For  the  approach  to  be  applicable,  each  phase  must  be  com¬ 
pleted.  The  importance  of  phase  1  should  be  obvious.  To  see 
the  importance  of  phase  2,  note  that  if  we  consider  two  optimal 
error  algorithms  (say  <p\  and  y>2)  for  models  M\  and  M2,  then  we 
know  that  the  computational  error  for  each  algorithm  is  inherent 
in  the  problem.  If  we  find  9!  to  be  better  than  y>2,  then  we  can 
draw  one  of  three  conclusions: 

•  M j  is  a  better  model  than  A/2. 

•  the  error  inherent  in  any  computational  embodiment  of  Mi 
is  so  large  that  Mi  is  effectively  better  since  the  error  is  in¬ 
herent  in  the  model  assumptions.  No  process,  including  the 
one  being  modeled,  could  compute  better  approximations. 

•  Mi  is  a  better  model  that  A/j,  but  our  theoretical  mea¬ 
sure  of  error  and  our  sampling  of  data  for  the  comparison 
are  such  that  the  error  of  Mi  on  that  sample  is  consider¬ 
ably  smaller  than  its  theoretical  error.  (Therefore  the  “rep¬ 
resentative”  samples  were  not  representative  of  the  errors 
inherent  in  the  model.) 

Thus  if  samples  in  phase  4  are  truly  representative,  the  prob¬ 
lem  is  well  conditioned  and  phase  3  shows  that  the  implementa¬ 
tion  of  the  algorithm  is  well  conditioned  and  numerically  stable, 
then  the  comparisons  in  phase  4  result  in  a  comparisons  of  the 
models,  not  just  the  algorithms.3 

We  note  that  any  one  of  the  phases  may,  and  often  will, 
be  difficult  if  not  impossible.  Obviously  phase  one  and  phase 
four  should  be  in  the  domain  of  knowledge  of  the  scientist  do¬ 
ing  the  modeling.  Phase  2,  which  may  be  the  most  difficult 
for  a  scientist  outside  of  computer  science,  may  often  be  accom¬ 
plished  by  referring  to  previous  work  in  optimal  algorithms,  see 
[Traub-Wozniakowski-80],  [Traub-Wasilkowski-Wozniakowski-83], 
and  [Micchelli-R.ivlin-77].  Finally  phase  three  may  be  accom¬ 
plished  by  consulting  a  numerical  analyst.  While  the  four  phases 
might  call  for  considerably  more  effort  than  a  simple  comparison 
of  different  algorithms  which  solve  a  problem  assuming  different 
models,  the  extra  effort  turns  a  simple  comparison  of  different 
algorithms  into  a  experiment  which  compares  different  models  of 
a  process. 

3  Example:  Reconstruction  of  Surfaces 
from  Sparse  Depth  Data 

In  this  section  we  discuss  some  of  the  details  of  the  use  of  op¬ 
timal  error  algorithms  to  study  models  for  the  process  of  surface 
reconstruction  in  the  human  visual  system. 

5A  problem  is  well-conditioned  if  small  changes  in  the  input  do  not  result 
in  wild  variations  in  the  solution.  An  algorithm  is  numerically  stable  if  the 
computed  result  is  the  exact  solution  to  a  slightly  perturbed  problem. 


Since  the  pioneering  work  of  Julesz,  (see  [Julesz-71]  and  the 
many  references  therein),  it  has  been  known  that  when  presented 
with  sparse  depth  data  the  human  visual  system  “perceives”  a 
dense  surface  rather  than  a  collection  of  unconnected  points  in 
depth.4  The  perception  is  both  vivid  and  stable.  Furthermore,  it 
is  the  perception  of  “smooth”  three  dimensional  surfaces.  There 
has  been  considerable  psychological  research  investigating  the 
phenomenological  aspects  of  the  process.  In  recent  years  there 
have  been  a  number  of  computational  models  of  how  this  process 
might  work.  These  computational  models  are  of  great  interest  to 
the  computer  vision  community. 

Researchers  at  MIT,  see  [Grimson-79],  and  [Grimson-81],  pro¬ 
posed  that  we  formulate  the  problem  (assuming  no  noise  in  the 
depth  data)  as  one  of  finding  a  surface  passing  through  the  data 
that  minimizes  a  given  measure  of  surface  unreasonableness.5  In 
their  work,  they  assumed  the  world  contained  surfaces  whose  sec¬ 
ond  derivatives  are  in  L2  and  that  the  measure  of  reasonableness 
was  given  by  the  bending  energy  of  a  thin-plate.  That  is  the  sur¬ 
faces  are  such  that  if  one  takes  the  the  second  derivative  of  the 
surface,  and  integrates  the  square  of  these  values  over  the  sur¬ 
face,  the  integral  is  bounded.  The  bending  energy  of  a  thin-plate 
is  given  by  the  second  Sobolev  semi-norm,  i.e. 


While  Grimson  presented  various  arguments  why  to  use  this  mea¬ 
sure,  there  was  no  formal  attempt  to  show  this  model  was  ap¬ 
propriate  for  human  vision,  nor  that  it  was  the  most  appropriate 
for  computer  vision. 

In  [Boult-86],  we  initiated  the  application  of  optimal  error 
algorithms  as  a  tool  for  modeling,  by  applying  the  general  ap¬ 
proach  of  Section  2  to  this  problem.  We  generalized  the  model 
proposed  by  Grimson  by  allowing  other  classes  of  functions  and 
different  measures  of  unreasonableness  of  a  surface.6  For  each 
model  we  implemented  an  optimal  error  algorithm  and  then,  us¬ 
ing  psychological  experimentation,  compared  the  models. 

3.1  Experimental  Procedure 

There  were  9  different  data-sets  considered,  each  generated 
by  a  different  underlying  surface.  These  were  considered  to  be 
representative  of  the  surfaces  we  would  encounter  in  computer 
vision.  We  describe  each  of  these  underlying  surfaces  in  turn  and 
the  way  in  which  that  data  was  sampled. 

Data-set  #1  A  basis  function  using  randomly  located  data.  The  data 
at  all  information  points  has  value  0  save  one  which  has  value  1. 

Data-set  #2  A  small  “wedding  cake”  produced  by  stacking  up  three 
planes  of  height  0,  y,  and  1  and  sampling  on  a  10  by  10  grid. 

’Sparse  depth  data  is  presented  by  means  of  random-dot  stereograms  or 
dicoptic  displays 

’This  minimization  property  was  proposed  in  part  because  the  minimiza¬ 
tion  of  an  energy  function  is  an  operation  that  might  be  performed  by  a 
neural  network,  see  [Grimson-81].  The  later  work  in  [Terzopoulos-84]  also 
makes  these  assumptions. 

'These  correspond  to  different  assumptions  about  “smooth  ”  surfaces  in 
the  world.  Because  of  the  arguments  about  neural-network  type  minimiza¬ 
tions,  we  maintained  that  aspect  of  the  model.The  strongly  optimal  error 
properties  of  the  algorithms  follow  from  theoretical  work  in  a  very  general 
setting,  see  [Lee-85]  or  [Boult-86].  It  is  interesting  to  note  that  the  algorithms 
proposed  by  the  MIT  researchers  are  discrete  approximations  to  the  optimal 
error  algorithm  for  the  class  they  chose. 


922 


Data-set  #3  100  randomly  located  points  taken  from  a  quarter  of  a 
wedding  cake  containing  four  planes  with  heights  0,  j,  and  1. 

Data-set  #4  A  parabolic  sheet  (height  0  to  .9)  defined  by  only  16 
regularly  spaced  points. 

.  Data-set  #5  A  half-cylinder  of  radius  lying  on  a  plane,  sampled  at 
100  randomly  located  points  in  the  unit  square. 

Data-set  #6  A  quarter  of  a  wedding  cake  (heights  of  the  four  planes 
are  0,  |,  and  1  )  defined  on  a  10  by  10  regular  grid. 

Data-set  #7  A  noisy  approximation  to  the  central  portion  (with  rect¬ 
angular  boundary)  of  a  hemi-spherical  surface  sampled  at  100 
randomly  placed  points. 

Data-set  #8  A  surface  defined  by  two  planes  meeting  at  an  acute 
angle  then  a  parabolic  sheet  meeting  the  right  edge  of  one  of  the 
planes.  Data  sampled  on  a  regular  10  by  10  grid. 

Data-set  #9  A  saddle  surface  (i.e.  rectangular  hyperbolic  sheet).  In 
this  case  the  data  is  located  at  100  randomly  located  points.  The 
range  for  the  data  was  [-.9,  9]. 

From  the  initial  data  we  reconstructed  strongly  optimal  error 
tstimates  using  a  reproducing  kernel  spline  based  algorithm  (see 
Boult-86])  under  different  formulation  assumptions.  The  formu- 
ations  (i.e.  class/norm  pairs)  and  their  letters  which  will  be  used 
n  subsequent  references  to  them  are: 

A:  D~- !! ~  75  with  the  second  Sobolev  semi-norm, 

8:  D~3H~i7i  with  the  third  Sobolev  semi-norm, 

C:  D~*H~77>  with  the  fourth  Sobolev  semi-norm. 


•  Orientation  discontinuities  (data-sets  5,  8) 

•  No  discontinuities  (data-sets  4,  7,  9) 

•  Overall  (data-sets  1,  2,  3,  4,  5,  6,  7,  8,  9). 

Note  that  the  performance  in  the  overall  category  may  reflect  a 
basis  in  the  selection  of  the  initial  data  (i.e.  how  many  surfaces 
with  discontinuities  were  considered  as  compared  to  the  number 
of  smooth  surfaces  tested).  For  this  reason  one  should  not  stress 
those  results. 

The  analysis  of  the  rating  responses  is  simply  the  calculation 
of  the  median,  mean,  and  variance  of  the  rating  of  “quality”.  This 
data  will  be  displayed  for  all  class/norm  pairs  in  graphical  form. 
The  graph  will  have  nine  columns,  one  for  each  class/norm  pair. 
In  each  column  the  mean  will  be  presented  as  a  x,  the  median 
presented  as  a  o  and  the  variance  as  a  line  of  the  appropriate 
length  centered  around  the  mean.  In  addition,  above  each  column 
will  appear  the  numerical  value  of  Quality,  the  mean  quality 
response  to  three  significant  digits. 

Throughout  the  discussion  we  shall  consider  the  relationship 
between  the  mean  rating  responses  and  the  “apparent  number 
of  derivatives”  for  each  class.  Note  that  many  of  the  classes  are 
defined  such  that  the  apparent  number  of  discontinuities  is  a 
fractional  value,  see  [Boult-86].  For  the  nine  classes  the  apparent 
number  of  derivatives  are:  class  .4-1.25,  class  S-1.25,  class  C- 
1,25,  class  X>— 1.5,  class  £-1.75,  class  T-2.0,  class  £-2.5,  class 
7f-4.0,  and  class  1-5.0. 


V:  D~7H~  5  with  the  second  Sobolev  semi-norm, 

£:  D~7 with  the  second  Sobolev  semi-norm, 

T '■  D~2 L?  with  the  second  Sobolev  semi-norm, 

Q:  D~7 H  5  with  the  second  Sobolev  semi-norm, 

H ■  D~*L3  with  the  fourth  Sobolev  semi-norm, 

1:  D~3L3  with  the  fifth  Sobolev  semi-norm. 

The  definition  of  these  classes  and  their  associated  semi-norms 
ppears  else  where  (see  [Boult-Kender-86]  or  [Duchon-76]).  The 
eader  can  see  sample  reconstructions  from  classes  V  and  T  (the 
lass  used  by  Grimson)  for  three  data-sets  in  Figures  5 — 10. 

In  our  psychology  experiments  to  compare  the  models  (phase 
of  the  general  approach)  the  subjects  were  presented  with  9 
ackages  each  containing  4  views  of  the  initial  data  and  36  sarn¬ 
ie  reconstructions.  The  subjects  rated  each  sample  on  “how 
/ell  it  fit  their  impression  of  the  surface  generating  the  initial 
ata”.  They  also  compared  each  sample  with  5  others  stating  if 
;  seemed  a  better,  equal  or  worst  fit  to  the  initial  data.  Note 
his  results  in  a  tremendous  amount  of  data.  In  total  each  of  the 
ix  subjects  ranked  324  sample  reconstructions  and  made  1485 
airwise  comparisons. 

.2  Experimental  Results 

In  the  experiment  there  were  two  separate  types  of  responses 
athered:  direct  rating  and  comparison.  Since  both  types  of 
as  ponses  result  in  similar  findings  we  present  only  the  former, 
'or  more  detail,  including  an  individual  analysis  of  every  data- 
et,  consult  [Boult-86].  Because  the  subjective  ranking  of  the 
lasses  will  depend  on  the  underlying  data,  we  shall  break  each 
nalysis  up  into  groups: 

•  Jump  discontinuities  (data-sets  1,  2,  3,  6) 


3.2.1  Analysis  of  Data-sets  with  Jump  Discontinuities 

The  data  on  which  this  discussion  is  based  was  generated  by  com¬ 
bining  the  rating  and  pairwise  comparisons  from  all  subjects  over 
those  data-sets  where  the  underlying  surface  has  a  jump  discon¬ 
tinuity,  i.e.  data-sets  #1,  #2,  #3  and  #6.  Examination  of  the 
rating  responses,  see  Figure  1,  suggests  that  classes  V,  A,  B  and 
C  are  best  suited  to  this  type  of  data,  while  classes  H ,  I,  and 
G  are  the  poorest  of  those  considered.  If  we  consider  the  rela¬ 
tionship  between  the  apparent  derivatives  and  the  rating,  we  see 
that  the  peak  in  quality  corresponds  to  an  apparent  derivative  of 
1.25,  and  that  as  we  increase  the  number  of  apparent  derivatives, 
the  quality  gradually  deteriorates. 

Quai„t  3.97  3.94  3  99  3  97  3  49  3  33  3  07  7  39  1.73 


R 

a 

t 

i 

n 

g 

R 

e 

s 

P 

o 

n 

s 

e 


ABCvereni 

Figure  1:  Processed  responses!^  data-sets  #1,  #2,  #3,  and  #6, 
i.e.,  the  data-sets  with  jump  discontinuities. 

Let  us  now  consider  the  quality  of  the  reconstructions  if  we 
fixed  the  apparent  differentiability  of  the  class  of  functions  in  the 


3.4  Analysis  of  Data-sets  with  No  Discontinuities 


class  and  vary  the  norm.  This  can  be  done  by  considering  classes 
A,  B,  and  C  each  of  which  has  1.25  apparent  derivatives,  but 
which  minimize  the  second,  third  and  fourth  Sobolev  semi-norm 
respectively.  The  rating  responses  (and  comparison  responses) 
for  these  three  classes  are  virtually  identical.  Thus  one  might 
conclude  that  for  these  data-sets,  the  actual  semi-norm  mini¬ 
mized  was  not  a  dominant  factor. 

We  can  also  consider  the  related  question  of  fixing  the  norm 
and  varying  the  class  of  functions.  This  is  accomplished  by  com¬ 
paring  classes  A,  V,  £ ,  T ,  and  Q,  These  classes  all  minimize  the 
second  Sobolev  semi-norm  but  their  functions  are  assumed  to 
be  in  the  semi-Hilbert  spaces:  D~2H~'7S ,  D~2H~S ,  D~2H~M, 
D~2L 2,  D~2Hs  which  have  quality  ratings  of:  3.97,  3.97,  3.49, 
3.22,  and  2.67  respectively.  Thus  we  see  there  is  an  inverse  re¬ 
lationship  between  the  apparent  differentiability  and  the  mean 
quality  rating  over  data-sets  with  jump  discontinuity. 

3.3  Analysis  of  Data-sets  with  Orientation  Dis¬ 
continuities 

The  processed  responses  which  appear  in  Figure  2  were  gener¬ 
ated  by  combining  the  raw  responses  from  all  subjects  for  data¬ 
sets  #5  and  #8.  Examination  of  the  rating  responses  in  Fig¬ 
ure  2  suggest  that  classes  T ,  H,  t ,  and  G  are  well  suited  to 
data  whose  underlying  function  has  orientation  discontinuities 
and  that  classes  1,  B,  and  C  are  poorly  suited  to  such  data. 


ABCverOHi 


Figure  2:  Processed  responses  forSdata-sets  #5  and  #8,  i.e.,  those 
data-sets  with  orientation  discontinuities. 

Again  let  us  now  consider  the  quality  of  the  reconstructions 
if  we  fixed  the  apparent  differentiability  of  the  class  of  functions 
in  the  class  and  vary  the  norm.  The  rating  responses  for  classes 
A,  B ,  and  C  are  virtually  identical.  Again,  one  might  conclude 
that  for  these  data-sets,  the  actual  semi-norm  minimized  was  not 
a  dominant  factor. 

We  can  also  consider  the  related  question  of  fixing  the  norm 
and  varying  the  class  of  functions.  We  see  there  is  a  unimodal 
(with  peak  at  2.5)  relationship  between  the  apparent  differentia¬ 
bility  and  the  mean  quality  rating  over  data-sets  with  orientation 
discontinuity. 


The  responses  presented  in  Figure  3  were  generated  by  com¬ 
bining  all  subjects’  responses  over  data-sets  #4,  #7,  and  #9. 
These  data-sets  were  generated  by  underlying  functions  which 
were  very  smooth,  in  fact  infinitely  differentiable.  Examination 
of  the  rating  responses  suggests  that  all  classes  except  X>  and  A 
are  well  suited  to  the  problem,  with  classes  H  and  X  being  slightly 
inferior  to  classes  T,  £,  G,  B  and  C.  The  comparison  responses 
support  the  idea  that  classes  T ,  G,  B  and  C  are  best  suited  to 
the  smooth  data-sets. 


Figure  3:  Processed  responses  tor  data-sets  #4,  #7,  and  #9,  i.e., 
those  data-sets  with  very  smooth  underlying  functions. 

Considering  the  relationship  between  norm  and  rating  re¬ 
sponses  we  see  that  classes  A,  B,  and  C  have  quite  different 
responses.  This  difference  is  due  to  the  fact  that  reconstruc¬ 
tion  for  class  B  and  C  are  exact  for  polynomials  of  order  2  and  3 
respectively.  Thus  they  exactly  reconstruct  the  underlying  data 
for  data-sets  #4  and  #9  (  but  not  #7  because  of  the  noise). 

With  respect  to  the  relationship  between  apparent  derivatives 
and  responses  (taking  into  account  the  fact  that  classes  H,T,B 
and  C  were  exact  because  of  their  null-spaces).  We  see  there  is  a 
unimodal  tendency  (with  peak  around  2).  However,  because  we 
exclude  almost  half  the  classes  the  result  is  considerably  weaker 
than  those  for  data-sets  with  jump  or  orientation  discontinuities. 

3.5  Analysis  of  Performance  over  AH  Data-sets 

Examination  of  the  overall  ratings  suggests  that  classes  T , 
£ ,  B  and  C  are  the  overall  most  appropriate  classes  of  thcee 
tested.  We  point  out  that  this  overall  score  is  very  dependent  on 
the  different  individual  data-sets,  as  is  apparent  from  the  large 
variance  in  rating  responses. 

3.6  Experimental  Conclusion 

We  have  demonstrated  models  which  yield  reasonable  and. 
sometimes  far  better  solutions  to  the  visual  surface  reconstruc- 1 
tion  problem  than  those  used  in  [Terzopoulos-84]  or  [Grimson-81]. 
While  we  have  not  fully  explored  the  computational  aspects  of 
these  trade-offs,  we  already  know  that  some  of  the  non-traditional 
classes  are  computationally  more  efficient  when  employing  the 
semi-reproducing  kernel  spline  algorithm. 


P 

\ 


924 


gwYTTty  3.01  3.M  3.91  3.3«  3.33  387  3.81  3.40  3.73 


5 


t 

i 


R 

e 


o 

n 

s  1 

e 


0I _ i _ i _ i _ i _ i — i — i — i — i — i 

A8CV£FGHI 

Class 

Figure  4:  Processed  responses  lor  data-sets  #1-9.  That  is,  com¬ 
bination  over  all  data-sets. 

Secondly,  the  most  appropriate  model  may  be  affected  by  the 
“smoothness”  of  the  underlying  data.  If  the  underlying  data  con¬ 
tains  more  than  one  surface  (i.e.  contains  jump  discontinuities), 
then  some  of  the  non-traditional  classes  were  considerably  bet¬ 
ter.  If  the  underlying  surfaces  were  smooth  most  of  the  classes 
tested  performed  reasonable  well. 

In  the  previous  sections  we  pointed  out  a  possible  relationship 
between  the  number  of  apparent  derivatives  in  the  class  and  the 
quality  of  the  associated,  reconstructions.  The  exact  nature  of 
the  relationship  seemed  dependent  on  the  actual  data-set,  but  it 
appeared  to  be  a  unimodal  relationship  with  the  location  of  the 
peak  depending  on  the  data-set  (accounting  for  differences  use 
to  null-spaces). 

Finally,  for  the  case  when  the  underlying  surface  had  jump 
or  orientation  discontinuities,  we  showed  that  changing  the  norm 
while  keeping  the  class  of  functions  basically  fixed  did  not  greatly 
affect  the  quality  of  reconstructions,  while  fixing  the  norm  and 
varying  the  class  of  functions  did  produce  significant  changes  in 
reported  quality. 


4  How  Contanimated  Data  Effects  the  Pro¬ 
cess 


A  common  technique  in  solving  ill-posed  problems  (such  as 
the  surface  reconstruction  problem  discussed  above)  is  with  the 
use  of  regularization  techniques  (see  [Poggio-Torre-Koch-85]  and 
the  references  therein  )  or  smoothing  splines  (see  [Whaba-84]). 
These  approaches  however  assume  that  one  knows  the  correct 
model  for  the  process  and  provides  approaches  for  dealing  with 
contaminated  error. 

When  modeling  a  process,  it  is  often  as  important  to  model 
the  error  as  it  is  to  model  the  underlying  process.  In  doing  this, 
the  modeler  has  two  distinct  choices  -  develop  a  single  model 
that  handles  both  the  underlying  process  and  the  error,  or  de¬ 
velop  separate  models.  If  one  chooses  the  former,  the  general 
approach  described  above  can  be  applied;  however,  the  deriva¬ 
tion  of  optimal  error  algorithms  becomes  more  difficult.  If  one 
separates  the  models,  one  can  first  experimentally  determine  the 
model  of  the  underlying  process,  then  using  this  may  develop 


algorithms  that  handle  contaminated  data.7 

5  Error,  Models  and  Complexity 


Traditionally,  one  develops  a  model  for  a  problem  and  an 
algorithm  which  solves  it  with  as  small  a  cost  as  possible.  Such 
an  approach  often  does  not  consider  what  the  error  inherent  in 
the  model  is,  and  how  that  is  related  to  the  error  of  the  developed 
algorithm.  In  Section  2,  we  discussed  the  genera]  approach  for 
refining  the  model  of  a  problem  using  optimal  error  algorithms. 
While  this  may  be  good  for  abstract  modeling,  we  realize  that 
an  often  important  aspect  of  a  model  is  that  it  can  be  used  to 
quickly  produce  some  predictions.  Thus,  one  desires  algorithms 
with  as  small  a  complexity  as  possible. 

While  it  is  often  the  case  that  optimal  error  algorithms  are 
also  optimal  complexity  algorithms  the  relationship  is  not  as¬ 
sured,  see  [Traub-Wasilkowski-Wozniakowski-83].  In  particular 
the  complexity  depends  heavily  on  the  model  of  computation 
(especially  if  parallel  computation  is  to  be  used),  while  the  er¬ 
ror  properties  are  inherent  in  the  problem,  independent  of  any 
computer.  Thus,  while  the  optimal  error  algorithm  generally 
does  not  significantly  change  between  the  (reasonable)  models  of 
computation,  the  complexity  of  that  algorithm  generally  does. 

However,  all  is  not  lost.  Once  we  have  developed  an  appropri¬ 
ate  model  for  a  problem,  nothing  is  to  prevent  us  from  using  that 
knowledge  to  develop  approximate  algorithms  which  are  compu¬ 
tationally  more  efficient  than  the  optimal  error  algorithm. 


References 

[Boult-86]  Terrance  E.  Boult.  Information  Based  Complexity  in 
Non-Linear  Equations  and  Computer  Vision.  PhD  thesis,  De¬ 
partment  of  Computer  Science,  Columbia  University,  1986. 

[Boult-Kender-86]  Terrance  E.  Boult  and  John  R.  Render.  Vi¬ 
sual  surface  reconstruction  using  sparse  depth  data.  In  Pro¬ 
ceedings  of  the  IEEE  Computer  Society  Conference  on  Com¬ 
puter  Vision  and  Pattern  Recognition,  pages  68-76,  June  1986. 

[Duchon-76]  J.  Duchon.  Interpolation  de  fonctions  de  deux  vari¬ 
ables  suivant  le  principe  de  la  flexion  des  plaques  minces.  Re¬ 
vue  Francaise  d’Automatique,  Informatique  et  Recherche  Op- 
erationnelle,  5-12,  December  1976. 

[Grimson-79]  W.  E.  L.  Grimson.  From  Images  to  Surfaces:  A 
Computational  Study  of  the  Human  Visual  System.  PhD  the¬ 
sis,  MIT,  1979. 

[Grimson-81]  W.  E.  L.  Grimson.  From  Images  to  Surfaces:  A 
Computational  Study  of  the  Human  Visual  System.  MIT  Press, 
Cambridge,  MA,  1981. 

[Julesz-71]  B.Julesz.  Foundations  of  Cyclopean  Perception.  Uni¬ 
versity  of  Chiage  Press,  Chiago,  IL,  1971. 

[Lee-85]  D.  Lee.  Contributions  to  Information-based  Complexity, 
Image  Understanding,  and  Logic  Circuit  Design.  PhD  thesis. 
Department  of  Computer  Science,  Columbia  University,  1985. 

7  In  the  visual  reconstruction  problems  discussed  above,  we  are  separat¬ 
ing  the  models  and  currently  researching  algorithms  for  the  contaminated 

case.  One  obvious  way  to  approach  the  approximation  problem  is  to  use  the 

smoothing  spline  form  of  the  reproducing  kernel  spline  algorithms. 


925 


[Micchelli-Rivlin-77]  C.  A.  Micchelli  and  T.  J.  Rivlin.  A  survey 
of  optimal  recovery.  In  C.  A.  Micchelli  and  T.  J.  Rivlin,  edi¬ 
tors,  Optimal  Estimation  in  Approximation  Theory ,  pages  1- 
54,  Plenum  Press,  New  York,  1977. 

[Poggio-Torre-Koch-85]  T.  Poggio,  V.  Torre,  and  C.  Koch. 
Computational  vision  and  regularization  theory.  Nature, 
317(6035):314-319,  1985. 

[Terzopoulos-84]  D.  Terzopoulos.  Multiresolution  Computation 
of  Visible-Surface  Representations.  PhD  thesis,  MIT,  1984. 

[Traub- Wasilkowski-Wozniakowski-83]  J.  F.  Traub,  G.  W. 
Wasilkowski,  and  H.  Wozniakowski.  Information,  Uncertainty, 
Complexity.  Addison-Wesley, Reading, MA,  1983. 

[Traub- Woiniakowski-80]  J.  F.  Traub  and  H.  Wofniakowski.  A 
General  Theory  of  Optimal  Algorithms.  Academic  Press,  1980. 

[YVhaba-84]  G.  Whaba.  Surface  fitting  with  scattered  noisy  data 
on  euclidean  d-space  and  on  the  sphere.  Rocky  Mountain  Jour¬ 
nal  of  Mathematics,  14(l):281-299,  1984. 


Figure  7:  Reconstruction  using  model  T  for  data-set  #  4. 


Figure  8:  Reconstruction  using  model  V  for  data-set  #  4. 


Figure  6:  Reconstruction  using  model  V  for  data-set  #  2. 


Figure  10:  Reconstruction  using  model  V  for  data-set  #  5 


926 


A  PARALLEL  IMPLEMENTATION  AND  EXPLORATION 
OF  WITKIN’S  SHAPE  FROM  TEXTURE  METHOD 


Lisa  Gottesfeld  Brown  and  Hussein  A.  H.  Ibrahim* 

Department  of  Computer  Science 
Columbia  University 
New  York,  NY  10027 


ABSTRACT 

This  paper  explores  A.P.Witkin’s  shape  from  texture  method  [1], 
We  show  the  results  and  performance  of  this  method  for  a  parallel 
implentation  developed  on  the  simulator  for  the  massively  paral¬ 
lel  tree-structured  supercomputer,  the  NONVON  [2].  We  show 
that  the  parallelization  of  Witkin’s  algorithm  not  only  substan¬ 
tially  improves  the  overall  efficiency  but  it  does  so  by  exploiting 
a  more  powerful  representation  which  could  be  used  by  other 
shape-from  methods.  Lastly,  we  show  how  the  accuracy  of  the 
method  can  be  improved  by  using  the  Gaussian  sphere  in  repre¬ 
senting  the  isotropy  assumption  and  we  compare  our  results  with 
the  original  method  and  the  more  efficient  methods  developed  by 
Davis,  Janos  and  Dunne  [3], 


INTRODUCTION 

In  this  paper,  we  discuss  an  implementation  of  A.P.Witkin’s 
algorithm  for  the  recovering  of  surface  orientation  from  texture 
for  the  highly  parallel,  SIMD,  tree-structured  NONVON  super¬ 
computer.  Witkin’s  approach  to  the  surface  from  texture  prob¬ 
lem  is  well-suited  to  a  wide  spectrum  of  textures  and  relies  upon 
the  assumptions  that: 

1.  The  image  is  an  orthographic  projection  representing  a  pla¬ 
nar  surface. 

2.  The  texture  itself  is  considered  to  be  the  distribution  of 
edge  directions  and  this  distribution  is  assumed  to  be  uni¬ 
form  prior  to  the  effects  of  projection.  More  importantly, 
the  method  will  work  to  the  extent  in  which  the  distribu¬ 
tion  of  edge  directions  that  compose  the  texture  do  not 
resemble  the  effects  of  projection. 

3.  All  surface  orientations  are  equally  likely.  If  we  consider 
each  surface  as  being  represented  by  a  point  on  the  Gaus¬ 
sian  sphere  then  a  given  surface  is  equally  likely  to  be  rep¬ 
resented  by  any  point  on  the  sphere  [4]. 

Given  that  these  assumptions  hold  completely,  this  method  has 
\  already  been  proven  to  be  quite  robust.  The  major  drawbacks 
i  are  that  it  fails  when  confronted  with  textures  with  regular  fea¬ 
tures  (assumption  2)  and  it  is  computationally  inefficient.  L.S. 

"This  work  supported  in  p»rt  by  DARPA  grant  #  N00039-84-C-0165  and 
DARPA  grant  #  DACA78-86-C-0024 


Davis,  L.  Janos  and  S.M.  Dunn  have  proposed  two  algorithms 
which  numerically  approximate  Witkin’s  method  and  speed  the 
computation  by  as  much  as  a  factor  of  four.  In  this  paper,  we 
consider  ways  to  improve  the  efficiency  through  parallelization. 
We  take  care  to  use  representations  which  give  accurate  measure¬ 
ments  and  can  be  used  to  extend  the  method  to  a  larger  domain. 

The  idea  behind  Witkin’s  method  is  as  follows.  We  assume 
that  the  distribution  of  edge  directions  is  uniform  prior  to  the 
projection.  If  the  surface  is  slanted  about  an  axis  parallel  to  the 
X-axis  by  an  angle  0  with  respect  to  the  image  plane  then  we 
expect  the  distribution  of  edge  directions  to  become  dominated 
by  edges  whose  directions  are  close  to  horizontal.  The  larger  0 
is,  the  greater  the  domination  of  these  edges.  This  makes  intu¬ 
itive  sense  since  all  the  edges  in  the  surface  are  being  more  and 
more  compressed  along  their  vertical  component.  If  the  surface  is 
slanted  about  a  different  axis  which  makes  a  tilt  angle  r  with  the 
X-axis,  then  the  distribution  is  just  shifted.  Examples  of  these 
distributions  for  various  slants  and  tilts  are  shown  in  Figure  1. 

L. Wolff  [5]  presents  an  outline  for  the  implementation  of 
this  method  on  the  NONVON  which  forms  the  basis  for  the  al¬ 
gorithm  used  here.  The  three  phases  of  the  algorithm  are: 

1.  The  determination  of  the  edges  in  the  image  in  each  of  the 
edge  directions  considered. 

2.  The  histogramming  of  the  edge  directions. 

3.  Computing  the  likelihood  for  the  histogram  for  each  possi¬ 
ble  surface  orientation  and  finding  the  maximum  likelihood. 

The  first  section  of  this  paper  discusses  the  first  two  phases  of  the 
algorithm  while  the  second  section  elaborates  upon  the  the  third 
phase.  The  final  section  shows  how  the  accuracy  of  the  method 
can  be  improved  by  using  a  better  representation. 

THE  EDGE  DIRECTIONS 

This  is  computationally  the  most  important  phase  of  the 
algorithm.  Since  this  phase  is  only  the  input  into  the  algorithm  it 
is  often  overlooked;  but  the  choice  of  representation  and  method 
used  here  greatly  effect  the  types  of  information  available  for  sub¬ 
sequent  use.  Once  the  histogram  of  edge  directions  is  found  the 
computation  of  the  likelihood  estimates  is  well  defined  and  com¬ 
putationally  less  intensive.  However,  there  are  several  different 
ways  to  determine  the  edges  and  their  orientations  and  we  would 


927 


0  30  60  30  100  BB  HB 


Edge  Direction  (degrees) 


Figure  1.  Plots  of  pdf  (edge  direction^,  r)  for  r  =  0  (left)  and  t  =  135  (right)  at 
several  values  of  6. 


like  to  do  this  both  efficiently  utilizing  our  parallel  architecture 
and  with  an  eye  to  facilitating  the  detection  of  edges  which  do 
not  satisfy  the  assumptions. 

For  the  initial  detection  of  the  edges,  Witkin  proposed  the 
Marr-Hildreth  zero-crossing  operator  which  he  used  successfully 
to  demenstrate  his  algorithm.  This  method  is  also  applied  here 
using  the  parallel  functions  for  the  NONVOVN  written  by  Bur¬ 
ton,  Lee,  and  Wolff[6].  Their  implementation  is  designed  about 
a  multiresolution  image  pyramid  for  the  purpose  of  stereopsis. 
With  this  arrangement  of  the  image  data,  their  implementation 
could  exploit  the  processing  tree  architecture  of  the  NON  VON. 
The  NONVON  supports  four  modes  of  communication  which  are: 
global Jinear, binary  tree  and  within  the  lowest  level  of  the  tree, 
there  is  mesh  communication.  Hence,  the  multiresolution  im- 
age  pyramid  is  stored  in  every  other  level  of  the  tree  and  intra- 
level  communication  is  optimal  for  the  lowest  level  image.  For 
two  pixel  neighbors,  PEI  and  PE2  at  a  higher  level  however, 
data  must  first  be  passed  from  PEI  down  the  tree  to  the  mesh 
level,  cross  over  to  the  appropriate  child  of  PE2  and  then  ascend 
the  tree  before  reaching  PE2.  Having  a  multiresolution  pyramid 
for  the  stereopsis  was  useful  for  performing  multi-level  match¬ 
ing  but  it  could  similarly  be  used  in  improved  edge  detection. 
Marr  and  Hildreth  (7]  develop  this  idea.  Each  level  is  used  to 
find  the  zero-crossings  for  differing  frequency  bands  and  then  a 
set  of  parsing  rules  determine  the  relationship  between  the  zero- 
crossings  at  each  level.  The  parsing  rules  are  determined  by 
physical  constraints  such  as  the  spatial  localization  of  intensity 
changes  due  to  each  physical  phenomena  in  the  visual  world. 
Thus,  zero-crossings  are  expected  to  be  found  at  each  level  un¬ 
less  more  than  one  phenomena  have  been  combined  from  a  higher 
frequency  band.  From  these  rules,  the  primitives  of  the  primal 
sketch  are  found.  This  same  technique  might  by  exploited  for  use 
with  Witkin’s  algorithm.  Distribution  of  edge  directions  found 
at  each  level  could  be  analysed  together  with  the  parsing  rules  to 
determine  the  frequency  band  which  give  the  most  reliable  results 
or  greatest  likelihoods.  In  fact,  one  of  the  problems  encountered 
with  zero-crossings  is  that  more  than  what  is  normally  perdeved 
as  the  edges  in  the  image  are  detected.  This  is  significant  since 
the  set  of  edges  which  are  optimal  for  Witkin’s  method  are  not 


known.  In  our  current  implentation  however,  just  the  mesh  level 
is  used  for  edge  detection  so  that  each  convolution  requires  only 
0(N2)  operations  where  the  convolution  mask  is  NxN  and  the 
computations  to  find  zero-crossingB  in  each  direction  requires 
0(5)  in  the  worst  case  (where  the  image  is  5x5)  and  O(k)  in 
the  average  case,  for  some  small  constant  k. 

Given  this  technique  for  edge  detection,  it  still  remains  to 
determine  the  directions  of  the  edges.  To  do  this,  let  us  first 
consider  the  principles  behind  our  edge  detection  scheme.  The 
Marr-Hildreth  operator  was  selected  as  the  optimal  smoothing 
filter  that  satisfied  two  physical  constraints.  The  first  constraint 
is  one  of  spatial  localization  since  it  is  believed  that  only  nearby 
points  contribute  information  to  the  intensity  changes  at  any 
given  point.  The  second  constraint  is  a  frequency  localization. 
We  would  like  a  smooth  and  band-limited  filter.  These  con¬ 
straints  lead  directly  to  the  D2G  operator  or  to  finding  the  zero- 
crossings  of  the  second  directional  derivative  of  the  image  con¬ 
volved  with  a  Gaussian  filter.  This  is  approximated  by  taking 
the  Laplacian  of  the  Gaussian  filtered  image  or  A 2G. 

The  first  thing  to  note  is  that  finding  a  zero-crossing  in  a 
given  direction  does  not  necessarily  tell  us  about  the  direction  of 
an  edge  at  this  point.  Indeed,  it  is  passible  that  there  is  a  zero¬ 
crossing  in  every  direction  at  a  given  point  when  for  example, 
there  is  only  a  single  edge  formed  by  a  uniform  intensity  change 
with  constant  lines  that  run  parallel  to  the  edge.  To  choose 
among  several  zero-crossings  at  one  point,  we  have  two  simple 
strategies: 

1.  Choose  the  zero-crossing  which  has  the  maximum  slope. 

2.  Choose  the  zero-crossing  whose  orientation  agrees  with  the 

orientations  of  other  neighboring  zero-crossings. 

The  second  strategy  is  supported  by  the  theorem  that  Marr 
and  Hildreth  call,  the  condition  of  linear  variation  in  which  for 
smoothed  images  (as  in  our  case)  the  two  strategies  above  are 
equivalent.  The  theorem  tells  us  that  the  intensity  variation  near 
and  parallel  to  the  line  of  zero-crossings  should  be  locally  linear. 
We  are  ultimately  interested  in  the  direction  of  an  edge  at  this 
point,  and  we  see  that  according  to  this  theorem  the  line  of  zero- 


r 

v  crossings  is  in  the  direction  of  the  edge  since  we  expect  linear 
intensity  changes  along  and  parallel  to  the  edge  and  a  maximum 
r  change  across  it. 

Thus,  there  are  two  approaches  that  can  be  taken  in  ex¬ 
tracting  the  edge  directions  from  the  zero-crossings,  both  which 
can  be  easily  implemented  in  our  parallel  scheme.  The  first  ap- 
.  proach  takes  the  edge  direction  as  orthogonal  to  the  orientation 
of  the  zero-crossing  with  the  maximum  gradient.  The  second  ap¬ 
proach  considers  an  edge  only  if  there  are  two  zero-crossings  of 
the  same  orientation  which  are  adjacent  to  each  other  on  the  line 
perpendicular  to  the  zero-crossing  orientation.  The  direction  of 
this  edge  is  then  the  direction  formed  by  the  line  on  which  the 
two  zero-crossings  lie.  This  method  has  the  advantage  that  an 
edge  is  more  clearly  part  of  contour  and  less  likely  to  have  been 
contributed  by  noise.  On  the  other  hand,  since  we  are  only  inter¬ 
ested  in  “texture,”  a  contour  may  be  irrelevant.  Finally,  an  edge 
in  the  vertical  or  horizontal  direction  is  more  easily  found  due  to 
the  nature  of  the  grid  and  this  will  skew  our  histogram  values. 

For  both  approaches  we  are  limited  by  four  directions  in 
which  we  usually  look  for  zero-crossings  given  an  image  grid 
where  each  pixel  has  neighbors  along  four  lines:  vertical,  the  two 
diagonals  and  horizontal.  For  histogramming  edge  directions, 
four  directions  can  suffice  depending  on  the  accuracy  desired  in 
the  results.  We  have  found  that  using  four  directions  with  accu¬ 
rate  information  is  all  that  is  needed  to  determine  which  of  36 
uniformly  spaced  (on  the  Gaussian  sphere)  surface  orientations 
for  an  image  that  is  only  32x32.  But  for  more  precise  results, 
we  need  to  consider  ways  of  refining  our  methods.  For  the  first 
approach,  we  can  consider  the  gradients  of  perpendicular  zero- 
crossings  as  orthogonal  components  of  a  vector  which  determines 
the  orientation  of  the  maximum  slope  and  thus  is  orthogonal  to 
the  edge  direction.  There  are  actually  two  pairs  of  orthogonal 
components  that  we  can  use  as  redundant  information  and  de¬ 
termine  the  direction  of  the  edge  with  the  accuracy  affordable  by 
the  data.  For  each  pair,  if  only  one  zero-crossing  is  found  the 
gradient  for  the  other  is  considered  to  be  zero.  If  both  pairs  have 
at  least  one  zero-crossing  then  the  two  resulting  vectors  can  be 
averaged.  If  the  two  resulting  vectors  are  more  than  90  degrees 
apart  then  the  information  at  that  pixel  is  thrown  out  since  it  is 
contradictory. 

For  the  second  method,  we  can  consider  triplets  or  quadru¬ 
plets  of  zero-crossings  in  order  to  increase  the  number  of  different 
directions  that  we  can  find  for  the  edges  but  this  requires  even 
longer  edges.  Furthermore,  the  distribution  of  detectable  edge 
directions  continues  to  be  nonuniform.  For  greater  discernment 
in  the  edge  directions  the  first  method  is  superior  for  our  pur¬ 
poses  since  we  can  uniformly  histogram  the  edge  directions  which 
are  found  in  a  single  non-expanding  set  of  operations.  However, 
if  the  vision  system  incorporates  a  contour  following  algorithm, 
then  this  method  might  serve  us  well.  Small  contours  could  be 
used  for  the  edge  direction  distribution  and  larger  contours,  in¬ 
dicative  of  non-textural  phenomena,  could  be  passed  on  to  other 
modules. 

Once  the  edge  directions  have  been  found,  we  can  take 
advantage  of  the  NONVON  tree  communication  capability  to 
compute  the  histogram.  Since  the  edge  directions  now  reside  at 
the  mesh  level,  we  simply  accumulate  the  number  of  edges  in 
each  direction  as  we  ascend  the  tree.  The  algorithm  will  require 
only  0(logn)  operations  where  n  is  the  number  of  pixels  in  the 
image  as  opposed  to  0(n)  needed  by  a  sequential  machine. 

There  are  of  course  several  alternatives  to  this  scheme  for 


determining  the  edge  directions.  Two  methods  that  we  have  con¬ 
sidered  sire:  (1)  using  the  gradient  determined  by  the  orthogonal 
Sobel  operators  and  (2)  using  the  wedge  shaped  regions  centered 
at  the  origin  of  the  power  spectrum  of  the  image.  The  former 
can  be  implemented  on  the  NONVON  in  a  similar  fashion  as  the 
Laplacian  of  the  Gaussian  since  it  is  a  convolution  and  it  has  the 
advantage  of  measuring  edge  orientation  more  accurately.  The 
latter  also  has  a  well-studied  parallelization  and  although  it  is 
not  as  efficient  as  the  convolution  with  a  small  mask,  the  his¬ 
togram  could  be  found  directly.  For  all  of  these  methods,  there 
is  an  interesting  threshold  parameter  which  can  be  used  like  the 
bandwidth  to  determine  the  optimum  maximum  likelihood. 

ORIENTATION  LIKELIHOODS 

This  phase  of  the  algorithm  although  computationally  somewhat 
complex,  is  not  nearly  as  computationally  intensive  and  there¬ 
fore  we  initially  thought  that  it  did  not  justify  implementation 
on  the  NONVON  but  could  more  easily  be  accomplished  by  its 
host  computer.  Basically,  it  consists  of  computing  the  likelihood 
of  each  possible  orientation  given  the  distribution  of  edge  direc¬ 
tions.  While  the  number  of  possible  orientations  is  infinite,  they 
can  be  well  represented  by  a  small  sized  subset  on  the  order  of  a 
hundred.  However,  after  considering  all  the  benefits,  it  becomes 
clear  that  it  is  worthwhile  to  take  advantage  of  the  parallel  NON¬ 
VON  architecture. 

The  likelihoods  that  we  need  to  compute  are: 

L(6  rU*l  =  f^TT _ TC'cose _ 

x  H  cos2 (o’  -  r)  +  sin2 (a'  -  r)cos29 

where  6,t  are  the  slant  and  tilt  and  A *  =  [ctj,...,a*]  are  the 
histogram  values.  This  equation  is  an  application  of  Bayes  The¬ 
orem  where  the  first  term  is  the  pdf(6,r)  and  the  rest  is  the 
P<f/(j4*|0,  r).  Notice  that  the  likelihood  of  a  particular  surface 
orientation  depends  on  the  probability  that  that  surface  occurs. 
We  assume  that  all  surfaces  are  equally  likely,  but  this  does  not 
mean  that  each  pair  of  slant  and  tilt  values  are  equally  likely. 
The  probabilty  that  a  surface  is  not  slanted  at  all  is  very  small 
compared  to  the  probability  that  it  is  slanted  90  degrees  because 
of  the  range  of  tilts  that  can  occur  in  the  latter  case.  This  plays 
an  important  role  in  determining  the  surface  orientation  of  max¬ 
imum  likelihood  (as  we  shall  see  later)  and  whenever  possible  it 
should  be  modified  to  reflect  more  pertinent  knowledge  of  the 
environment  or  information  obtained  by  other  shape-from  meth¬ 
ods. 

Davis,  Janos  and  Dunne  have  shown  two  ways  to  find  the 
maximum  likelihood  more  efficiently  by  (1)  using  a  good  geomet¬ 
ric  heuristic  to  estimate  the  tilt  and  then  using  a  one-dimensional 
Newton  method  to  obtain  the  slant  and  (2)  expressly  computing 
the  orientation  by  accurately  approximating  the  equations.  Both 
methods  do  indeed  improve  the  efficiency  (the  second  one,  very 
effectively)  although  with  some  loss  of  accuracy.  On  the  other 
hand,  much  relevant  information  is  lost  especially  if  we  would  like 
to  use  this  method  in  conjunction  with  other  shape-from  tech¬ 
niques.  Specifically,  we  no  longer  know  the  likelihoods  at  each 
orientation.  Witkin  showed  in  his  original  paper  the  iso-density 
contour  maps  of  the  orientation  density  functions  obtained  fo. 
images  of  geographic  contours.  These  figures  show  graphically 


:  929 

I 

) _ 


i 


Figure  2.  The  center  column  contains  the  images  whose  orientations 
are,  top:  r  =  90, 0  =  45,  center:  r  =  90, 9  =  20,  bottom:  r  =  45,0  =  45. 
The  left  column  contains  the  likelihood  at  each  surface  orientation  where 
where  the  distance  from  the  origin  is  9  and  the  angle  is  t.  The  right 
column  contains  the  Fourier  spectra. 


930 


much  more  information  that  can  be  gathered  from  a  single  max¬ 
imally  likely  orientation.  Not  only  are  other  likely  orientations 
exposed  but  the  sharpness  of  the  peaks,  unlikely  orientations, 
and  the  overall  shape  of  the  confidence  measures  are  seen.  For 
this  reason,  and  because  of  the  non-trivial  speed-up,  we  decided 
to  parallelize  this  stage  of  the  algorithm,  one  PE  for  each  likeli¬ 
hood,  so  that  a  fine  grained  output  of  the  orientation  likelihoods 
could  be  obtained.  Once  these  likelihoods  are  found,  the  maxi¬ 
mum  can  be  found  by  traversing  up  the  tree.  But  in  addition, 
we  have  created  an  ideal  structure  for  outputting  a 11  the  like¬ 
lihoods  to  other  shape-from  methods  and  for  feedback  between 
the  likelihoods  and  the  edge  direction  distributions  found  at  vary¬ 
ing  bandwidths  and  thresholds.  This  latter  result,  can  be  used 
to  identify  the  edges  which  optimally  satisfy  the  assumptions, 
allowing  Witkin’s  method  to  be  used  on  more  natural  textures 
successfully.  A  sample  of  our  results  are  given  in  Figure  2. 

IMPROVING  THE  ACCURACY 

The  assumption  that  all  surfaces  are  equally  likely  is  called  the 
isotropy  assumption  and  from  it  the  joint  pdf  of  the  slant  and  tilt 
is  derived.  Using  the  pdf  that  Witkin  derived,  given  above,  gives 
poor  results  because  of  the  singularly  small  probability  associ¬ 
ated  with  a  surface  lying  close  to  the  image  plane.1  This  causes 
an  error  in  the  predicition  of  the  surface  orientation  whenever  the 
slant  is  small.  This  error  arises  because  of  the  choice  of  the  co¬ 
ordinate  space  used  to  represent  orientation.  Furthermore,  if  we 
take  equally  spaced  intervals  of  9  and  r  the  probability  associated 
with  each  grid  square  is  not  uniform  and  thus  the  discrimination 
of  surface  orientation  is  also  not  uniform. 

All  of  this  can  be  easily  remedied  by  a  more  appropriate 
choice  of  the  coordinate  space  representation  of  orientation.  The 
singularity  that  occurrs  when  the  slant  is  small  is  a  result  of  the 
nonuniform  probability  associated  with  each  orienation  region, 
the  regions  being  smallest  when  the  slant  approaches  zero.  If  in¬ 
stead  of  selecting  equally  spaced  intervals  of  $  and  r,  we  choose 
equal  size  areas  on  the  Gaussian  Sphere,  the  joint  pdf  is  simply 
a  uniform  distribution  and  thus  a  constant.  The  advantages  are 
twofold:  large  errors  at  small  slants  are  avoided  and  the  resolu¬ 
tion  of  the  results  becomes  more  consistent. 

Figure  3  shows  the  decrease  in  the  error  as  a  function  of 
slant  by  transforming  the  orientation  space  into  Gaussian  Sphere. 
Ten  random  samples  of  2000  uniformly  distributed  tangent  di¬ 
rections  were  projected  at  a  constant  tilt  for  each  slant.  The 
projected  tangent  directions  are  then  histogrammed  and  given 
as  input  into  Witkin’s  original  method  which  finds  the  maximum 
likelihood  of  100  surfaces:  10  slants  and  10  tilts,  equally  spaced 
and  into  the  modified  method  which  finds  the  maximum  like¬ 
lihood  of  100  surfaces  approximately  uniformly  spaced  on  the 
Gaussian  Sphere.  The  error  in  the  predicted  orientation  (9‘,  r*) 
from  the  known  orientation  (9,r)  is  measured  in  two  ways.  The 
bottom  plot  shows  the  rms  error  similar  to  that  used  by  Davis, 
Janos  and  Dunne  defined  by, 

c(rma)  =  \J(9  -  0*)J  +  (r  -  r*)5 

where  the  difference  between  two  tilt  angles  is  the  minimum  of 
the  possible  differences  of  the  two  angles  modulo  r.  Notice  the 

*Lee  and  Rocenfcld  [8]  have  corrected  this  to  account  for  the  discrete 
nature  of  the  data  and  concluded  that  the  pdf  should  be  (l/2*)»tn20  but 
this  does  not  alter  the  problem  mentioned  here. 


inordinately  large  errors  at  small  slants  for  both  methods.  This  is 
in  part  because  at  small  slants  the  effects  of  the  direction  of  slant, 
namely  the  tilt,  are  neglible.  (See  Figure  1.)  On  the  other  hand, 
a  large  tilt  error  at  small  slants  does  not  indicate  that  a  large 
rotation  is  needeed  to  orient  the  actual  surface  to  the  predicted 
one  and  yet  the  size  of  this  rotation  is  more  representative  of  the 
error  that  we  percieve.  Therefore,  an  improved  measure  of  the 
error  would  be  the  angular  distance  separating  the  actual  surface 
0  from  the  predicted  surface  O’  where  both  are  represented  by 
three  dimensional  cartesian  vectors  on  the  Gaussian  Sphere: 

((sphere)  =  arccos(0  ■  O’). 

The  top  plot  in  Figure  3  shows  the  same  errors  measured  in 
this  way.  There  is  still  a  large  error  associated  with  small  slants 
although  this  better  represents  the  error  we  would  percieve.  It 
is  not  surprising  these  errors  are  still  large  since  changing  the 
orientation  of  a  surface  at  small  slants  changes  the  region  that  is 
percieved  only  slightly  while  at  large  slants,  the  size  and  location 
of  the  physical  region  seen  changes  drastically. 


The  implementation  of  this  method  on  the  simulator  for  the 
NONVON  computer,  has  exposed  both  the  advantages  and  limi¬ 
tations  of  Witkins’s  statistical  approach  to  texture.  One  obvious 
limitation  is  that  not  all  surface  orientations  can  be  distinguished. 
For  every  surface  orientation  that  the  method  finds,  there  is  an¬ 
other  orientation  which  slants  in  the  opposite  direction  which  is 
equally  likely.  This  is  obviously  counter  to  human  perception.  It 
suggests  that  this  method  is  somehow  incomplete  or  the  wrong 
approach.  However  there  are  several  reasons  why  this  method  is 
appealing.  Most  other  methods  which  derive  shape  from  texture 
rely  upon  gradient  measures.  These  methods  rely  upon  deter¬ 
mining  texel  size,  shape  or  spacing  and  assumptions  concerning 
their  uniformity.  These  methods  are  complicated  by  determining 
or  locating  texels  and  limited  to  very  specific  domains.  Notice 
how  these  methods  contrast  with  Witkin’s  approach  which  will 
often  fail  when  their  assumptions  hold  but  work  when  they  fail. 
For  example,  uniformly  shaped  texels  will  cause  Witkin’s  method 
to  fail  if  the  texel  shape  is  biased  towards  certain  directions  but 
when  the  texels  are  all  shaped  differently  Witkin’s  is  likely  to  be 
applicable.  Thus,  it  would  be  useful,  if  Witkin’s  method  could 
be  applied  so  that  instead  of  solely  determining  the  orientation 
which  is  most  likely  to  be  represented  by  the  given  edge  distribu¬ 
tion,  the  likelihood  of  each  orientation  was  used  as  information  to 
be  passed  onto  to  other  shape  from  texture  methods.  These  other 
methods  which  measure  shape  from  texture  using  gradients,  can 
then  determine  precisely  what  Witkin  doesn’t:  the  local  direc¬ 
tionality  which  uniquely  determines  the  surface  orientation. 

References 

[1]  A.  P.  Witkin,  “Recovering  Surface  Shape  and  Orientation 
from  Texture,”  Artificial  Intelligence  17  (1981),  17-45. 

[2]  D.  E.  Shaw,  “Organization  and  Operation  of  a  Massively 
Parallel  Machine,”  to  appear  in  Advanced  Computer  Tech¬ 
nology,  Guy  Rabbat,  editor,  Van  Nostrand  Reinhold,  1986. 

[3]  L.  Davis,  L.  Janos,  and  S.  Dunn,  “Efficient  Recovery  of 
Shape  from  Texture,”  IEEE  Trans,  on  Pattern  Analysis  and 
Machine  Intelligence  PAMI-5  No.  5,  Sept.  1983,  486-492. 


i 


931 


[4]  J.  R.  Render,  “The  Gaussian  Sphere:  A  Unifying  Repre¬ 
sentation  of  Surface  Orientation,”  Department  of  Computer 
Science,  Carnegie  Mellon  University,  Pittsburgh,  PA  15213. 

[5]  L.  Wolff,  “Parallel  Algorithms  for  the  Determination  of 
Shape  from  Texture,”  Research  Project  report  under  Hus¬ 
sein  Ibrahim,  Computer  Science  Department,  Columbia  Uni¬ 
versity,  New  York, NY  10027. 

[6]  T.  Burton, A.  Lee,  and  L.  Wolff, “Parallel  Stereo  Algo¬ 
rithms:  A  Multiresolution  Approach,”  Research  Project  re¬ 
port  under  Hussein  Ibrahim,  Computer  Science  Department, 
Columbia  University,  New  York, NY  10027. 


[7]  D.  Marr  and  E.  Hildreth,  “Theory  of  Edge  Detection,”  Proc. 
Royal  Soc.  London  B207,  1980,  187-217. 

[8]  C.  H.  Lee  and  A.  Rosenfeld,  “Improved  Methods  of  Estimat¬ 
ing  Shape  from  Shading  Using  the  Light  Source  Coordinate 
System,”  University  of  Maryland  Computer  Science  Techni¬ 
cal  Report  TR-1277,  April  1983. 


Figure  3. 

Comparison  of  Wttktn's  Original  Method 
and  Modified  Version  (dotted  Hne) 


0  B  20  30«SB6B  70  80  9B 

SLflNT 


932 


LOCALIZED  INTERSECTIONS  COMPUTATION  FOR  SOLID  MODELLING 
WITH  STRAIGHT  HOMOGENEOUS  GENERALIZED  CYLINDERS 


Jean  Ponce  and  David  Chelberg 


Stanford  Artificial  Intelligence  Laboratory 
Stanford  University,  Stanford,  Ca  9f305 


Abstract:  This  paper  reports  progress  in  the  develop¬ 
ment  of  a  solid  modelling  system  combining  straight  ho¬ 
mogeneous  generalized  cylinders  through  set  operations. 
Two  basic  components  of  this  system  are  the  modules 
which  compute  the  set  operations  between  primitives  and 
display  the  resulting  solids  using  ray  tracing.  These  two 
modules  are  also  very  computationally  intensive  as  they 
involve  a  large  number  of  surface-surface  and  ray-surface 
intersections  computations.  We  introduce  a  novel  hierar¬ 
chical  representation  for  straight  homogeneous  cylinders 
called  Box  Tree.  The  Box  Tree  is  analogous  to  a  Quadtree 
in  parameter  space.  It  is  an  exact  boundary  representa¬ 
tion  which  describes  the  surface  of  the  associated  gener¬ 
alized  cylinder  by  a  hierarchy  of  enclosing  boxes.  We  use 
the  Box  Tree  to  efficiently  compute  the  set  operation  and 
ray  tracing  algorithms  by  localizing  the  search  for  inter¬ 
sections  to  the  regions  where  they  may  occur.  We  discuss 
complexity  issues  and  illustrate  the  performances  of  our 
modelling  system  on  a  variety  of  examples. 

Introduction 

A  generalized  cylinder  is  the  solid  obtained  by  sweeping  a 
surface,  its  cross  section,  along  a  curve,  its  axis,  or  spine. 
The  axis  is  not  necessarily  straight,  or  even  planar;  the 
cross  section  is  not  necessarily  circular,  or  even  constant; 
its  deformation  along  the  axis  is  governed  by  a  sweep¬ 
ing  rule.  Generalized  cylinders  were  intended  to  represent 
primitive  solids,  while  complete  shapes  would  be  repre¬ 
sented  as  part- whole  graphs  of  joined  primitives  [2].  Gen¬ 
eralized  cylinders  have  been  extensively  used  to  represent 
3D  objects  in  computer  vision  [13], [15], [23].  The  most  suc¬ 
cessful  vision  system  to  date  using  generalized  cylinders  as 
its  primary  representation  for  three  dimensional  objects  is 
probably  Acronym  [3], 

Even  in  Acronym,  however,  only  very  restricted  subclasses 
of  generalized  cylinders  are  implemented  (circular  or  sim¬ 
ple  polygonal  cross  section,  straight  or  circular  spine,  lin- 

Sapport  for  this  work  was  provided  in  part  by  the  Air 
Force  Office  of  Scientific  Research  under  contract  F33615-85- 
C-5106  and  by  the  Advanced  Research  Projects  Agency  of  the 
Department  of  Defense  under  Knowledge  Based  Vision  con¬ 
tract  AIADS  S1093-S-1. 


Figure  1.  Straight  homogeneous  generalized  cylinders  1-a  a  solid  of 
revolution,  1-b  a  SHGC  with  a  star  shaped  cross  section  and  a  sweeping 
rule  which  is  a  polynomial  of  degree  4 


ear  or  bilinear  sweeping  rule).  Very  restricted  joining  op¬ 
erations  are  considered.  In  fact,  subparts  are  simply  af¬ 
fixed  to  each  other.  On  the  other  hand,  solid  modelling 
systems  have  used  for  years  complex  joining  operations, 
such  as  set  operations  [19], [25],  or  blending  of  surfaces 
[7],  The  primitive  solids  are  in  general  simple  (polyhe- 
dra  bounded  by  a  few  planar  or  quadric  faces),  but  set 
operations  now  extend  to  more  complex  primitives,  such 
as  polyhedra  with  many  planar  faces  [12], [18],  or  solids 
bounded  by  parametric  surface  patches  [4]. 

This  paper  reports  progress  in  the  development  of  a  geo¬ 
metric  modelling  system  which  manipulates  a  wide  class 
of  generalized  cylinders  with  the  power  and  flexibility  of 
a  CAD  system.  Our  primitives  are  straight  homogeneous 
generalized  cylinders  (SHGC's,  see  [23]) .  They  are  gen¬ 
eralized  cylinders  obtained  by  scaling  a  reference  cross 
section  along  a  straight  axis  (Figure  1).  In  the  modelling 
system  presented  here,  the  scaling  function  is  an  arbitrary 
continuous  (C°)  and  piecewise  continuously  differentiable 
(Cl)  real  function.  The  cross  section  is  an  arbitrary  star 
shaped  C°  and  piecewise  C1  curve  (see  Section  2,  Figure 

4)- 

The  primitives  are  joined  through  set  operations  (union, 
intersection,  difference).  Bay  tracing  [20], [26]  is  used  for 
hidden  line/surface  display  of  the  resulting  composite  ob¬ 
jects.  Set  operations  and  ray  tracing  algorithms  are  tradi- 


923 


Figure  2.  The  Box  Tree  representation,  to  each  quadrant  of  a  Quadtree  in 
parameter  space  we  associate  a  box  which  encloses  the  associated  sector. 
At  the  leaf  level,  the  nodes  contain  also  a  representation  of  the  sectors 
themselves 

t tonally  very  computationally  expensive  as  they  involve  a 
large  number  of  complex  surface-surface  and  ray-surface 
intersections  computations.  This  is  more  generally  true 
of  any  intersection  algorithm  which  manipulates  complex 
surfaces.  A  recent  approach  [4], [10], [12], [18]  for  speeding 
up  these  algorithms  has  been  to  use  a  hierarchical  repre¬ 
sentation  of  surfaces  to  efficiently  localize  the  search  for 
geometric  intersections  to  the  regions  where  they  may  oc¬ 
cur. 

In  this  paper,  we  follow  an  analogous  approach.  We  intro¬ 
duce  a  novel  hierarchical  representation  for  SHGC’s  called 
the  Boi  Tret  (Figure  2).  It  corresponds  to  a  Quadtree  [22] 
in  parameter  space  and  represents  the  associated  SHGC 
by  a  hierarchy  of  boxes  enclosing  surface  patches  called 
sectors.  The  Box  Tree is  an  exact  representation  (i.e.  it 
does  not  involve  any  approximation  of  the  object).  We  use 
it  to  localize  efficiently  the  set  operation  and  ray  tracing 
algorithms.  We  discuss  the  localized  intersections  scheme 
in  more  detail  in  Section  1.  We  define  in  Section  2  a 
SIlGC's  sector,  the  associated  enclosing  box,  and  build 
the  Box  Tree.  In  section  3  and  4  we  give  the  localized  set 
operation  and  ray  tracing  algorithms.  Complexity  issues 
are  discussed.  Both  algorihms  have  been  implemented  and 
are  illustrated  in  Figures  3, 6,8, 9  and  12. 


Figure  3  The  display  of  the  union  ut  a  cylinder  and  a  polynomial  SHGC 
using  ray  tracing  at  the  limbs. 


1.  Localized  intersection  algorithms 

Set  operation  algorithms  for  boundary  representations  of 
solids  can  be  decomposed  into  the  following  steps. 

[1]  Compute  the  list  of  all  the  intersecting  faces  of  both 
solids,  and  associate  to  each  of  them  the  list  of  faces 
it  intersects. 

[2]  Use  this  list  to  compute  the  intersection  curves  and 
cut  the  intersecting  faces  into  non  intersecting  sub¬ 
faces  along  these  curves. 

[3]  The  intersection  curves  partition  the  surfaces  of  both 
solids  into  a  set  of  connected  components  (CC’s)  of 
non  intersecting  faces.  Classify  the  CC’s  of  each  solid 
as  being  inside  or  outside  of  the  other  solid. 

The  resulting  solid  is  formed  by  the  appropriate  combi¬ 
nation  of  inside  and  outside  faces  of  both  solids  (e.g.  the 
union  is  made  of  all  the  outside  faces).  Let  S\ ,  S2,  and 
S3  be  the  complexities  of  the  three  steps.  The  overall 
complexity  is  S  =  max(S  1 ,  S2,  S3). 

Similarly,  ray  tracing  can  be  decomposed  for  each  pixel 
into  the  following  steps: 

[1]  Compute  the  list  of  all  the  faces  intersecting  the  ray 
associated  to  the  pixel. 

[2]  For  each  fare  in  this  list,  compute  the  intersection 
point.  Find  the  first  point  along  the  ray  and  use  it  to 
compute  the  intensity  of  the  pixel. 

The  overall  complexity  of  ray  tracing  per  pixel  is  R  — 
mar(Ri ,  /f2)  where  Rt  and  7?2  are  the  complexities  of  the 
two  steps. 

Suppose  now  that  the  solids’  surfaces  are  represented  by 
polyhedra  whose  n  (possibly  curved)  faces  all  have  compa¬ 
rable  sizes.  To  find  the  intersecting  faces  of  two  solids,  we 
must  compare  all  faces  of  both  solids,  so  S\  is  0(n2).  S2  is 
proportional  to  the  length  of  the  list  of  intersecting  faces. 
For  homogeneous  decompositions,  this  length  is  0(p\/n), 
where  p  is  the  perimeter  of  the  intersecting  curves  (the 
area  of  a  surface  grows  as  the  square  of  the  length  of  the 
curves  drawn  on  it).  Finally,  for  each  connected  compo¬ 
nent  CCj,  step  3  involves  classifying  one  face  as  inside  or 
outside  (O(n)  step,  see  for  example  [11]),  and  then  spread¬ 
ing  this  classification  to  all  the  faces  of  CC,  (0(n,)  step, 
where  it,  is  the  number  of  faces  of  CC,.  if  we  suppose  that 
accessing  the  neighbors  of  a  face  can  be  done  in  constant 
time).  It  follows  that  S3  is  O(c.n)  where  c  is  the  number 
of  CC’s.  p  and  c  are  essentially  constants,  so  the  overall 
complexity  of  the  set  operation  algorithm  is  0(n2).  Sim¬ 
ilarly,  for  ray  tracing  all  faces  must  be  tested  against  a 
ray,  so  R\  is  O(n).  f?2  is  proportional  to  the  essentially 
constant  number  r  of  faces  actually  intersecting  the  ray, 
so  R  is  O(n). 

In  the  context  of  set.  operations,  several  authors  (see  for 
example  [4], [12], [18])  have  suggested  to  use  hierarchical 
surface  representations  to  localize  the  search  for  intersect- 


934 


ing  faces  to  the  regions  where  they  may  occur,  i.e.  near  the 
intersection  curves.  Using  this  method,  they  have  been 
able  to  reduce  the  overall  complexity  of  the  set  operation 
algorithm  to  O(n).  Mantyla  and  Tamminen  [12]  organize 
the  space  around  solids  in  a  structure  called  HOX- I'm II, 
and  use  it  to  efficiently  access  all  faces  intersecting  a  given 
face.  Carlson  [I]  and  Ponce  and  (-'angoras  [IX]  use  hierar¬ 
chies  of  enclosing  boxes  (Quadtree  and  Prism  Tree  data 
structures)  to  prune  the  search  for  the  intersecting  faces. 
Similarly,  it  is  possible  to  use  the  Prism  Tree  representa¬ 
tion  for  polyhedra.  or  a  similar  representation  for  fractal 
surfaces  [C] .[10]  to  reduce  the  ray  tracing  complexity  H 
to  O'*  bjgn).  In  the  next  Section,  we  introduce  the  Box 
Tree  representation  for  SIKiC's.  We  will  show  in  Sections 
3  and  1  how  we  can  use  it  to  localize  the  set  operation 
and  ray  tracing  algorithms  and  reduce  their  respective 
complexities  to  0(\/Ti)  and  logo,  where  n  is  the  number 
of  faces  in  the  equivalent  homogeneous  decomposition. 

2.  Sectors,  Boxes,  and  the  Box  Tree 

We  now  define  sectors  and  boxes  (Figure  5)  and  build  the 
Box  Tree  (Figure  2).  A  point  on  the  surface  of  a  straight 
homogeneous  generalized  cylinder  is  given  by  the  following 
equation. 

OP(8,  c)  =  r(z)p(8)(cos67  +  sin8J)  4  zk: 

(8.:)  €  [0,2jr]  x  (!) 

The  function  p(8)  specifies  the  shape  of  the  cross  section 
of  the  SIKiC.  while  r(z)  is  the  scaling  sweeping  rule  of 
the  SHCIO.  Notice  that  this  formulation  assumes  that  the 
cross  section  is  star  shaped  (Figure  I).  In  the  sequel  we 
suppose  that  p(8)  and  r(:)  are  strictly  positive.  and 
piecewise  ( '' 


Figure  4.  A  region  is  star  shaped  with  respect  to  a  point  V  ifT  for  each  point 
on  its  boundary,  the  whole  segment  between  V  and  the  point  is  included 
inside  the  region  Convex  regions  form  a  strict  subset  of  star  shaped  regions 


Definition:  the  sector  of  a  straight  homogeneous  general¬ 
ized  cylinder  associated  to  a  quadrant  Q  =  [8  i ,  S2] x  [- 1  •  c2] 
is  the  set  of  points  l’(8,  c)  such  that  (8,  ;)  G  Q 

For  star  shaped  cross  sections,  it  is  straightforward  that  a 
sector  is  a  connected  part  of  the  generalized  cylinder  sur- 


i 


Figun  .S  A  mi-',  r  an. I  !  fi«-  ■< .-  ■  t  !■■  ■  I!..  :■  <  <-  ,i  tv  ( ,  I,.  r- 

le-ni-il  eUm-s  tv.-  -. i-r!  - ■ . ;  j. !»•.,■>  nn.i  •  i  I t.-  -  ;  s.-r, j!,  |  t  .  Hi,-  qu.vtri- 
lateral  i  ,1 .  H,  (  /> 


faro,  cotiiplrtclv  imlmNul  ui!  hin  l  lie  plane's  :  - 
9  —  and  w  -  !  n  \  //.  (  and  l>  *  I  #  •  i  m  ► !  *  ■  I  fir  four 

oornors  of  ihr  vfinr,  i.r.  t!i<‘  point*  dnfiimd  ro^port ivolv 
by  (^| .  : ]  i.  {'*■  :  •  i  "■  .  : j )  atni  yii  .  \  |t  i-  -t rainlit lor- 

wnrd  that  tfm-r  |<»ur  puintt  an*  in  a  -anm  plain-  If  l  sint* 
two  piano  paralli  l  lo  II  v.*-  ran  l»uiM  a  box  *  nrloMiio  ihr 
xvtor. 

Definition:  : ! m ■  !.t.\  a--«  .■lai.-.l  it.  a  xrn.ir  m  i  h<*  con¬ 
vex  pol\ Imlnui  i 1 1 1 1 1 1  •-< )  liy  tin*  plntm-  *  -  m.  h  =  H-.. 

:  —  and  Ho-  two  {»!;*»,«•-  It'  and  If  parallel 

to  II  '.urli  that  '  iitir  i  i  a  no-  from  II  ■  <  im-poiid-  »o  thr 
uouatixt-  i  mv>  ti  mi  to  : .  1 1 « I  }»"- 1 1 1  \  ••  in  a  \  i  n  in  m  *  »f  i  In  di-«am*r 
of  a  '•rotor  point  to  II 

flu*  tli'f  a  n<f  )>•  !  v.f.-n  t  Hr  plains  II  and  II-  i-ralhd  fhick- 

IH  SS  of  tllr  lm\  ill  til-'  •  r  i  ( 1 1  r  I  |l  f  I  i  a  F.  U  t  r  !'  I /*•-.  I  l|r  I  It*  111  - 

mos*n  of  tin-  lit  t  i , «  l...\  :«»  tin-  a-'onaird  vetor.  It  i* 
>t  ra  i  o  1 1 1  forward  that  a  M.ti.r  i-  rntirrK  rmlo-rd  into  tin* 
associated  iio\  I  hm  implir-  that  if  two  i»o\«*s  don’t  m- 
IcrM'fl  .  thru  1  hr  a--'i«  lalrtl  'i-clui-v  (Ion  t  i  III  rf-rct  rithrr 
C'onvorsrly.  ,f  iw.i  iiifrr-.vt .  thru  tin*  a^ocialed 

l»o\r^  iutlT'rct  tut. 

w  e  now  >hou  how  to  compute  ||,  ;utd  I1-.  i.e  how  to 
compute  tin-  ext  !•  - 1 1 1  ,t  of  t]i<-  (signed!  <b-.tance  ii  between 
a  point  /’  and  the  plane  II.  ,/  i-  giv<m  by 


where  i--  a  point  in  the  plane  and  ;7  is  a  i non  neces¬ 
sarily  unit)  vector  orthogonal  to  II  It  is  i-xt rental  at  any 
point  where  the  Numerator  •  (  /'.  II )  is  exremnl.  We  call  r 
the  denormali/eil  distance  in  the  serpiel.  If  we  choose  for 
example  /’,  —  t  ueget  after  some  algebraic  manipulation 

»  =  (:•_■  -i![l  eisinh-  |i' 

-  (  e-'C I o  Cosh;  |/j 

-  ( r  .  ;■)  b'i  i>-jsin(8:  )b 


935 


And 

e{P,  n)  =  rp(z2  -  Zi)[(p2sin{e 2  -  8)  -  pisin(0,  -  0)] 
+  p\p2sin(62  -  0()[r2(2,  -  z)  -  ti(z2  -  2)] 

We  suppose  from  now  on  that  both  p  and  r  are  C1  within 
the  sector.  The  local  extrema  of  d  are  located  at  the  points 
where  both  the  partial  derivatives  of  c  are  0.  Let  us  write 
these  derivatives. 

Oc 

—  =rp'(:2  -  :i)[p2ain(82  -8)-  pxain(8x  -  8)\ 

-  rp(:2  -  2i)[p2ros(02  -  8)  -  p,cos(8 1  -  6)] 

j:  =r'p(z 2  -  2 1  )[p2sin(02  -  8)  -  pisin(0i  -  8)\ 

-  (r2  -  rl)pip2sin(82  -  8X) 

As  r  is  strictly  positive,  the  local  extrema  of  d  are  finally- 
given  by  the  points  (8.z)  which  verify  the  two  equations: 

p'[p2sin(82  -  8)  -  pisin(8 ,  -  0)]  = 

=  p[p2cos(82  —  8)  —  picos(8\  —  8)\  (2) 

r'p(z2  -  :i)[p2sin{82  -  8)  -/>i*«n(0,  -  8)]  = 

=  (r?  ~  TX)p\p2sin(82  -  8,)  (3) 

Notice  that  Equation  (2)  is  an  equation  in  8  only.  It  can 
be  solved  numerically  on  the  interval  [0i,02].  Once  the 
solutions  have  been  computed,  the  corresponding  values  of 
8  can  be  substitued  in  (3),  which  becomes  a  one  parameter 
equation  in  It  is  solved  the  same  way.  If  no  extremum 
is  found  inside  the  sector,  we  must  look  for  extrema  in  one 
of  the  planes  8  —  81,  8  —  82,  2  =  2t,  and  2  =  r2.  This 
is  done  by  substituting  the  right  value  of  2  or  8  in  (2)  or 
(3),  and  solving  this  equation.  Once  the  points  for  which 
the  distance  is  extremum  are  found,  the  actual  distance  is 
computed  from  the  denormalized  distance  by  dividing  it 
by  the  norm  of  n. 

We  can  now  define  the  Box  Tree  as  a  hierarchy  of  boxes. 

Definition:  the  Box  Tret  (Figure  2)  associated  to  a  quad¬ 
rant  Qg  of  a  straight  homogeneous  generalized  cylinder  is 
a  tree  of  degree  4  where  each  node  N  is  associated  to  a 
subquadrant  Q  of  Q0  in  parameter  space  and  contains  a 
pointer  to  the  associated  box  B.  The  root  is  associated 
to  Qq.  If  a  node  N  is  not  a  leaf,  it  has  four  sons  that 
correspond  to  the  four  subquadrants  of  Q.  The  leaves  of 
the  tree  contain  a  pointer  to  the  associated  sectors. 

The  easiest  way  to  build  the  Box  Tree  associated  to  the 
surface  of  a  SHGC  as  defined  in  (1)  would  be  to  associate 
the  root  of  a  Box  Tree  to  the  quadrant  Qo  =  (0,  2tr]  x 
[r_,2+]  and  then  subdivide  it  normally.  Unfortunately 
this  is  impossible,  as  the  box  associated  to  a  sector  is 
defined  only  if  82  -  8\  is  not  it  or  2tr  (otherwise  the  planes 
8  =  8i,  8  =  82  and  fl  are  parallel).  So  we  choose  to 
associate  only  three  sons  to  the  root  of  the  Box  Tree  BTq 
corresponding  to  Qo-  They  are  associated  to  the  three 


Figure  6.  A  polynomial  SHGC,  a  cylinder,  and  their  intersection  curves. 


quadrants  [0,  2tt/3]  x  [ 2 2+],  [2tt/3,  4tt/3]  x  [ 2 2+],  and 
[4zr/3,  27t]  x  [z_,z+].  In  the  case  where  p  or  r  are  only 
piecewise  C1 ,  the  root  points  to  as  many  sons  as  there  are 
quadrants  where  p  and  r  are  both  C1 . 

Each  of  the  ends  of  a  SHGC  is  represented  by  the  one¬ 
dimensional  equivalent  of  the  Box  Tree,  a  tree  of  degree  2 
obtained  by  recursive  subdivisions  of  8  intervals.  Let  BT\ 
and  BT2  be  these  two  trees,  the  root  of  the  Box  Tree  BT 
associated  to  a  SHGC  points  to  BTo,  BT\,  and  BT2.  A 
simple  enclosing  box  (sphere  and  discs)  is  associated  to 
BT  and  each  Bl\.  In  the  complexity  analyses  of  Section 
3  and  4,  we  will  suppose  that  Box  Tree  nodes  are  always 
subdivided  homogeneously  into  four  subnodes.  We  have 
just  seen  that  this  is  only  an  approximation  at  the  root 
level  of  the  tree,  but  it  doesn’t  change  the  overall  com¬ 
plexity  of  the  algorithms. 

Carlson  [4]  and  Ponce  and  Faugeras  [18]  have  used  very- 
similar  representations  for  solids  bounded  by  parametric 
patches  [5]  and  polyhedra.  In  these  representations  how¬ 
ever,  the  total  number  of  nodes  at  the  leaf  level  is  fixed  in 
advance.  It  is  simply  the  total  number  of  faces  (paramet¬ 
ric  patches  or  planar  polygons).  In  our  case,  the  SHGC  is 
entirely  specified  by  one  analytic  expression,  and  we  don’t 
have  to  subdivide  its  surface  uniformly  into  sectors.  In¬ 
stead,  the  surface  is  subdivided  adaptively,  and  the  Box 
Tree  is  built  at  the  same  time,  during  the  execution  of  the 
intersection  algorithms.  This  is  especially  important  for 
set  operations  (see  Section  3). 

The  Box  Tree  is  an  exact  representation:  it  follows  from 
the  fact  that  the  sectors  associated  to  the  leaves  of  the  tree 
partition  exactly  the  surface  of  the  SHGC.  No  approxima¬ 
tion  is  made  during  the  localization  step  of  the  Box  Tree 
algorithms.  Approximations  only  occur  during  the  (rela¬ 
tively  cheap)  second  steps  of  both  set  operation  and  ray 
tracing  algorithms  (see  Section  3  and  4).  In  the  current 
implementation,  the  sectors  are  then  replaced  by  approx¬ 
imating  planar  faces  (the  quadrilaterals  ABCD).  Some 
numerical  scheme  could  be  used  instead,  so  this  step  could 
be  tuned  to  achieve  a  given  accuracy  without  recomputing 
the  localization  step.  This  is  an  advantage  of  the  Box  TYee 
over  homogeneous  decompositions,  where  the  resolution  is 
fixed  in  advance. 


936 


A  last  remark  about  the  Box  Tree:  it  is  the  world  space 
counterpart  of  a  Quadtree  [8], [21], [22], [24]  in  parameter 
space,  “enriched”  by  some  geometric  information.  This 
allows  us  to  use  the  classical  analysis  of  Quadtree  al¬ 
gorithms  to  get  an  estimate  of  the  Box  Tree  algorithms 
complexity.  This  will  prove  especially  useful  for  set  opera¬ 
tions  (see  next  Section).  Notice  however  that  although  the 
Box  Tree  is  similar  to  the  Quadtree  in  parameter  space, 
it  is  very  different  from  the  Octree  related  data  structures 
[9], [14]  which  generalize  the  Quadtree  to  3D  space.  The 
Octree  directly  represents  the  volume  of  an  object  and  is 
attached  to  a  given  reference  frame,  while  the  Box  Tree  is 
a  boundary  representation,  and  is  attached  to  the  object 
itself  (in  world  space). 

3.  The  set  operation  algorithm 

We  follow  the  basic  algorithm  described  in  Section  1.  The 
only  difference  with  homogeneous  decompositions  is  in  the 
first  step,  finding  the  intersecting  faces.  The  algorithm  is 
a  direct  generalization  of  the  Strip  Tree  [1]  and  Prism  Tree 
[18]  algorithms.  The  two  trees  are  visited  in  parallel.  At 
a  given  level  of  the  recursion,  two  nodes  are  compared. 
If  their  boxes  don’t  intersect,  then  the  associated  sectors 
don’t  intersect.  The  recursion  stops.  If  the  boxes  intersect 
and  are  thicker  than  a  given  threshold  then  the  thickest 
node  is  subdivided  (to  ensure  that  boxes  of  equivalent  sizes 
are  always  compared,  see  [l] ) ,  and  the  recursion  proceeds. 
In  the  remaining  case,  the  nodes  are  not  subdivided.  They 
are  leaves,  and  the  associated  sectors  are  then  compared 
directly.  If  they  intersect,  the  two  nodes  and  all  their  an¬ 
cestors  are  marked  as  intersecting.  During  the  execution 
of  the  algorithm,  for  each  node,  a  list  of  all  nodes  inter¬ 
secting  it  is  maintained. 


Figure  7.  The  Box  Tree  of  the  polynomial  SHGC  of  Figure  6  after  the 
computation  of  the  intersection  curves.  The  black  squares  are  intersecting 
leaves 

We  now  give  a  procedural  version  of  the  algorithm.  We 
assume  we  have  the  following  functions: 

Boxes-Intersect{N\,  N?)  returns  true  iff  the  boxes  associ¬ 
ated  to  the  nodes  N\  and  /V2  intersect.  Sedors-Intersect 
(N i,  A2)  returns  true  iff  the  sectors  associated  to  the  no¬ 
des  N i  and  A2  intersect.  This  function  in  particular  de¬ 
pends  on  the  implementation:  it  may  test  directly  planar 
approximations  to  the  actual  surfaces  (as  in  the  current 


implementation),  or  use  a  numerical  scheme  to  compute 
the  actual  surface  intersection  between  the  sectors.  The 
procedure  Subdivide(N)  builds  the  sons  of  the  node  N 
after  having  checked  that  these  sons  don’t  already  exist. 
Notice  that  all  these  functions  can  be  executed  in  essen¬ 
tially  constant  time. 

Procedure  Find- Bo^-Trees-Interseding-N odes  (N\,  A2); 
begin 

if  Boxes-Interseci  (N j,  jV2)  then 
begin 

if  (Ai  t  . thickness  <  e)  and  (Ar2  f  thickness  <  e) 
and  Sectors- Intersect  ( Ni ,  jV2) 
then  begin 

Ni  |  .Mark  <—  Inter-,  7V2  f  .Mark  <—  Inter; 

N i  f  .Inter secting-N odes 

<—  N\  f  .Intersecting- Nodes  +  A2; 

A2  |  .Inter  secting-N  odes 

<—  A2  |  .Inter secting-N odes  +  N j 

end 

else  (*  The  recursion  proceeds  *) 
if  N i  |  thickness  >  A2  t  . thickness 
then  begin 
Subdivide(Ni); 
for  5)  in  N\  f  .Sons  do 
begin 

Find-Box-T rees-Intersecting- Nodes  (St ,  A2); 
if  (Si  |  Mark  =  Intersection) 
then  A’]  j  .Mark  *—  Intersection 
end 
end 

else  begin 
Subdivide(Ni); 
for  S2  in  JV2  t  Sons  do 
begin 

Find-Box-Trees-Intersecting-Nodes  (Aj,  S2); 
if  (S2  [  .Mark  =  Intersection ) 
then  A2  [  .Mark  —  Intersection 
end 
end 

end 

end; 

Figure  6  shows  an  example  of  the  application  of  the  al¬ 
gorithm  to  the  computation  of  the  intersection  curves  of 
a  SHGC  with  a  polynomial  sweeping  rule  and  a  cylinder. 
Figure  7  shows  the  Box  Tree  associated  to  the  polynomial 
SHGC,  and  Figure  8  shows  the  set  operations  between 
these  two  objects.  Finally  Figure  9  shows  an  example  of 
set  operations  between  a  SHGC  with  edges  and  two  cylin¬ 
ders.  We  now  use  the  analogy  between  the  Box  Tree  and 
the  Quadtree  and  the  classical  analysis  of  [8]  for  Quadtree 
algorithms  to  get  the  complexity  of  the  set  operation  al¬ 
gorithm. 

Hunter  and  Steiglitz  [8]  describe  an  algorithm  for  com¬ 
puting  the  Quadtree  for  a  polygon,  i.e.  a  Quadtree  such 
that  only  nodes  intersecting  the  boundary  of  a  polygon 
have  children,  and  all  nodes  are  colored  either  inside,  oui- 


937 


psa 


si'lr.  nr  ml'  ™ i  ltrii  In  our  rase.  after  having  computed 
the  intersect  ion  curves,  the  Box  I  ree  can  he  condensed. 
In  other  term',  nodes  all  of  whose  children  are  non  in¬ 
tersecting  nodes  can  he  marked  as  non  intersecting  (this 
addit  ion, at  step  is  required  because  two  boxes  can  intersect 
although  the  corresponding  sectors  don't  intersect .  so  two 
intersecting  nodes  may  have  only  non  intersecting  chil¬ 
dren).  After  classification  of  the  non  intersecting  nodes, 
the  Box  I  roe  is  then  exactly  equivalent  in  parameter  spare 
to  the  Quadtree  for  polygon  of  die  intersection  curves. 

I  lie  analysis  of  llunler  and  Sleigiil/  for  Quadtrees  for 
polygons  applies  liny  'how  that  tin-  total  number  of 
node'  ill  the  tree  i'  (>(/>. 2’  I  where  ;i  t-  the  perimeter  of 
the  intersection  rnrve(s),  and  //  is  the  depth  of  the  tree.  In 
fact,  here  p  1-  the  perimeter  in  parameter  '|i,- ice  l  or  well 
behaved  surfaces,  the  perimeter  in  world  -pace  is  propor¬ 
tional  to  the  perimeter  in  parameter  .-.pace,  and  the  com¬ 
plexity  in  world  'pace  -  lays  the  same  I  lie  actual  number 
of  nodes  built  during  the  -et  operations  algorithm  may 
ill  fact  be  slightly  higher,  as  -ntnr  nodes  have  to  be  con¬ 
densed.  figure  10  shows  on  an  example  that  the  total 
number  of  nodes  is  in  fact  approxmiat ively  a  linear  fiiur- 
tion  of  as  predicted.  1  he  total  number  of  intersecting 
sectors  is  bounded  by  the  total  number  of  nodes.  This 
implies  that  .V.  and  V,  are  both  <7(/i  2 ’ ). 

I  lie  number  of  nodes  in  each  tree  gives  a  lower  bound 
to  the  complexity  of  the  set  operation  algorithm.  Once 
again,  we  ran  use  Hunter  s  and  Steiglitz  s  analysis  to  get 
an  idea  of  the  complexity  of  the  localization  step.  They 
show  that  the  subdivision  of  the  intersecting  nodes  (anal¬ 
ogous  to  our  lorali/at ion  step)  also  has  a  C)(p. T‘ )  complex¬ 
ity.  The  localization  step  can  be  slightly  more  expensive 
than  the  subdivision  step  for  Quadtrees,  as  a  node  may 
intersect  several  other  nodes,  and  some  intersecting  boxes 


SSiiiiiii  SssisK 

-  0:::ixk . 


a::;. 

SSSBSSS8SSS 

. 


Figure'  9.  An  example  of  set  operations  between  a  SHGC  with  edges  and 
cylinders.  The  resulting  solid  is  displayed  by  using  a  z-buffer  algorithm  to 
render  an  approximation  of  the  sectors  by  planar  faces. 


don  t  actually  correspond  to  sectors  intersections.  If  we 
assume  most  of  them  do,  ,?!  is  0(p. 2s).  Figure  11  shows 
on  an  example  that  the  number  of  intersections  follows 
approximative!)'  this  law. 

Our  analysis  shows  that  the  total  complexity  of  the  set 
operation  algorithms  is  approximative^  0(p.2?).  Notice 
that  if  we  had  used  an  homogeneous  decomposition  of 
the  surface,  the  total  number  of  sectors  per  object  would 
have  been  0(- Is).  so  the  complexity  of  our  algorithm  is 
0(x/n).  where  n  is  the  equivalent  number  of  faces  for  an 
homogeneous  decomposition.  The  computing  time  on  a 
lisp  machine  is  2  minutes  for  the  complete  set  operation 
algorithm.  Notice  that  t he  number  of  intersections  at  the 
deepest  level  of  the  tree  is  only  9301,  although  the  total 
number  of  faces  for  each  tree  would  be  more  than  25000 
if  a  homogeneous  decomposition  was  used. 


Ftguri-  10  The  intal  number  <»f  nodes  in  the  Elox  Tree  of  the  polynomial 
SIK»C  of  Figure  f.,  drawn  as  a  function  of 


4.  The  ray  tracing  algorithm 

The  basic  algorithm  is  analogous  to  the  algorithms  for 
fractals  [10]  and  Prism  Trees  [18]:  at  a  given  level,  test 
the  ray  against  a  node.  If  it  does  not  intersect  its  box,  the 
recursion  stops.  Otherwise,  the  recursion  proceeds.  Test 
the  ray  against  the  fours  sons  of  the  node,  and  update  a 
sorted  active  list  of  intersecting  nodes.  At  the  leaf  level, 
compare  directly  the  sectors  of  the  active  list  to  the  ray, 


**u 


Figure  11.  The  number  of  intersection  tests  for  the  set  operations  between 
the  SHGC  and  the  cylinder  of  Figure  6,  drawn  as  a  function  of  2*. 

the  visible  point  corresponds  to  the  first  intersection.  In 
fact  this  algorithm  can  be  improved  by  observing  that, 
as  the  boxes  associated  to  the  sons  of  a  node  are  convex 
and  disjoint,  they  can  always  be  sorted  from  front  to  back 
along  the  ray.  The  trick  of  the  algorithm  is  to  explore  the 
tree  in  a  depth  first  manner,  always  visiting  the  sons  of  a 
node  from  front  to  back.  This  way,  the  first  intersection 
found  is  guaranteed  to  be  the  good  one.  No  active  list 
is  needed,  and  in  the  ideal  case  only  one  branch  of  the 
tree  is  visited.  Again,  we  give  a  procedural  version  of  the 
algorithm.  We  assume  we  have  the  following  functions: 
Sort-Intersections(Ray,  N)  computes  the  intersections  of 
the  ray  Ray  with  the  boxes  associated  to  the  sons  of  N  and 
returns  the  list  of  those  intersections,  sorted  in  increasing 
depth  order.  Sector-Raycast(Ray,  N)  returns  the  inter¬ 
section  of  Ray  and  the  sector  associated  to  N  (or  nil  if  no 
such  intersection  exists).  Again,  this  function  may  either 
compute  the  intersection  of  the  ray  and  an  approximat¬ 
ing  planar  face,  or  compute  the  actual  intersection.  In 
the  current  implementation,  it  computes  the  intersection 
with  the  approximation,  but  the  normal  at  the  point  is 
interpolated  to  yield  a  more  precise  value  of  the  intensity. 

Function  Box-Tree- Raycast  (Ray,  Node)  :  point ; 
begin 

Intersection-Point  —  nil; 
if  Node  I  .thickness  <  t 
then  (“Test  ray  against  sector.  Stop  recursion*) 
Intersection-Point  <—  Sector- Raycast  (Ray,  Node) 
else  begin  (“Proceed.  Test  and  sort  sons.*) 
Subdivide(Node); 

Sorted-Sons-List 

*—  Sort-Intersections  (Ray,  Node  f  .Sons); 
Head-List  •—  Sorted-Sons-List; 
while  (Head-List  ^  nil)and(not  Intersection- Point) 
do  begin 

Intersection- Point 

*—  Box-Tree- Raycast(Ray,  Head-List  j  .Node); 

Head- List  •—  Head- List  ]  .next 

end; 

end;  ^ 

Box-Tree- Raycast  <—  Intersection- Point; 
end; 


Figure  12  The  ray  traced  image  of  a  scene  made  of  SHGC's. 

Let  us  again  try  to  analyze  the  complexity  of  the  algo¬ 
rithm.  In  the  ideal  case  the  first  branch  of  the  tree  visited 
leads  to  an  intersecting  sector.  The  complexity  in  this 
case  is  proportional  to  the  depth  <j  of  the  tree  for  each 
ray.  Equivalently,  it  is  O(logn)  where  n  is  the  number  of 
faces  of  the  equivalent  homogeneous  decomposition.  This 
time,  the  complexity  of  the  construction  of  the  tree  is  dif¬ 
ferent,  as  it  is  likely  that  all  the  surface  of  the  SHGC  has 
to  be  subdivided  up  to  the  same  level.  This  leads  to  a 
linear  complexity  in  the  number  of  faces,  but  the  tree  is 
computed  just  one  time  for  all  the  image  rendering. 

Figure  12  shows  the  shaded  ray  traced  image  of  a  scene 
made  of  SHGC’s.  Notice  the  shadows.  This  image  has 
been  computed  at  a  512  x  512  resolution,  with  anti  alias¬ 
ing  at  the  boundary  between  objects  (using  a  double  res¬ 
olution  there).  The  maximum  depth  of  the  tree  is  6,  so 
the  number  of  equivalent  polygons  is  around  40000.  The 
computing  time  on  a  lisp  machine  is  3  hours.  Figure  13 
shows  the  average  number  of  intersection  tests  per  ray 
traced  as  a  function  of  resolution  for  the  example  of  Fig¬ 
ure  12.  In  this  case,  the  shadows  are  not  computed.  The 
figure  shows  that  the  number  of  intersections  is  roughly 
proportional  to  the  depth  of  the  tree,  as  predicted  by  the 
analysis. 

Finally,  Figure  3  shows  the  limbs  and  intersection  curves 
of  the  union  of  a  polynomial  SHGC  and  a  cylinder,  ren¬ 
dered  by  using  ray  tracing  at  the  limbs  and  intersection 
curves  only.  The  limbs  are  computed  by  using  the  method 
described  in  [16],  The  advantage  of  this  method  is  its  effi¬ 
ciency:  the  ray  tracing  is  computed  only  on  a  few  curves, 
instead  of  all  the  image  pixels.  It  is  particularly  suited  for 
line  drawings,  with  a  CPU  time  of  about  5  minutes  on  a 
lisp  machine. 

Conclusion 

We  have  presented  the  Box  Tree,  a  new  hierarchical  repre¬ 
sentation  for  straight  homogeneous  generalized  cylinders 
and  have  used  it  to  efficiently  compute  set  operations  and 
ray  tracing.  The  bulk  of  this  paper  was  dedicated  to 
graphics  problems.  The  modelling  system  presented  here, 
however,  has  been  developped  in  the  context  of  vision.  It 
will  be  used  to  predict  observable  features  in  the  images 


939 


[7]  Hoffmann,  C.,  and  Hopcroft,  J.,  “Automatic  surface 
generation  in  Computer  Aided  Design”,  TR  85-661, 
Dept,  of  Comp.  Science,  Cornell  University,  1985. 


Figure  13  The  average  number  of  intersection  tests  per  ray  for  the  scene 
of  Figure  12,  drawn  as  a  function  of  q 


of  the  modelled  objects.  For  example  we  have  shown  else¬ 
where  [16], [17]  how  the  model  of  a  SHGC  could  be  used  to 
predict  its  ribbons  and  find  its  axis  in  real  images.  More¬ 
over,  when  several  primitives  are  combined  through  set 
operations,  the  intersection  curves  between  the  surfaces 
of  the  primitives  generate  observable  intensity  discontinu¬ 
ities  in  images.  The  set  operation  algorithm  described  in 
Section  3  can  be  used  to  compute  these  curves  and  pre¬ 
dict  the  properties  of  their  projections  in  images.  Finally, 
we  believe  that  a  wider  class  of  generalized  cylinders  and 
joining  operations  are  necessary  to  model  classes  of  real 
objects.  Our  next  step  will  be  to  investigate  the  represen¬ 
tation  of  other  primitives  (for  example  generalized  cylin¬ 
ders  whose  spine  is  a  space  curve,  or  whose  cross  section  is 
not  orthogonal  to  the  axis,  see  [17])  and  joining  operations 
(like  blended  surfaces  and  articulations). 

References 

[1]  Ballard,  D.H.,  Strip  Trees:  A  hierarchical  represen¬ 
tation  for  curves.  Comm,  of  the  ACM,  Vol  24,  No  5 
(May  1981),  pp.  310-321. 

[2]  Binford,  T  O.,  Visual  perception  by  computer,  Proc. 
IEEE  Conf.  on  Systems  and  Control,  Miami  (Decem¬ 
ber  1971). 

[3]  Brooks,  R.A.,  Symbolic  reasoning  among  3D  models 
and  2D  images,  Artificial  Intelligence  17  (1981),  pp. 
285-348. 

[4]  Carlson,  W.E.,  An  algorithm  and  data  structure  for 
3D  object  synthesis  using  surface  patch  intersections, 
Comp.  Gr.  SIGGRAPH  82  Conf.  Proc.  16  (1982), 
3,  pp.255-264. 

[5]  Catmull,  E.,  A  subdivision  algorithm  for  computer 
display  of  curved  surfaces,  University  of  Utah,  Salt 
Lake  City,  1974. 

[6]  Fournier,  A.,  Russel,  D.,  and  Carpenter,  L.,  Com¬ 
puter  rendering  of  stochastic  models,  Comm,  of  the 
ACM,  Vol  25,  No  6  (June  1982),  pp.  371-384. 


[8]  Hunter,  G.M.,  and  Steiglitz,  K.,  Operations  on  im¬ 
ages  using  quad  trees,  IEEE  Trans.  Patt.  Analysis 
and  Machine  Intel!.,  Vol  PAMI-1,  No  2  (1979). 

[9]  Jackins, C.L.,  and  Tanimoto,S.L.,  Oct-trees  and  their 
use  in  representing  3D  objects,  Dept,  of  Comp.  Sci¬ 
ence,  Univ.  of  Washington,  Seattle,  Techn.  Rep. 
79-07-06  (1980). 

[10]  Kajiya,  J.T.,  New  techniques  for  ray  tracing  proce- 
durally  defined  objects,  Proc.  SIGGRAPH  83  (july 
1983),  pp.  91-102. 

[11]  Kalay,  Y.E.,  “Determining  the  spatial  containment 
of  a  point  in  general  polyhedra”,  Computer  Vision, 
Graphics,  and  Image  Processing  19,  pp.  303-334, 
1982. 

[12]  Mantyla,  M.,  and  Tamminen,  M.  , Localized  set  op¬ 
erations  for  solid  modelling,  Proc.  SIGGRAPH  83 
(july  1983),  pp.  279-288. 

[13]  Marr,D.,  and  Nishihara,K.,  Representation  and  reco¬ 
gnition  of  the  spatial  organization  of  three  dimen¬ 
sional  shapes,  Proc.  Royal  Soc.  of  London,  B-200, 
pp.  269-294  (1977). 

[14]  Meagher,  D.,  Geometric  modelling  using  octree  en¬ 
coding,  Comp.  Gr.  Image  Pr.,  Vol  19  (1982),  pp. 
129-147. 

[15]  Nevatia,  R.,  Machine  perception,  Prentice-Hall, 
Englewood  Cliffs,  NJ,  1982. 

[16]  Ponce,  J.,  and  Chelberg,  D.,  Finding  the  limbs  and 
cusps  of  generalized  cylinders,  Proc.  of  the  4“  IEEE 
International  Conference  on  Robotics  and  Automa¬ 
tion,  April  1987. 

[17]  Ponce,  J.,  Chelberg,  D.,  and  Mann,  W.,  Invariant 
properties  of  the  projections  of  straight  homogeneous 
generalized  cylinders,  submitted  to  to  the  First  Inter¬ 
national  Conference  on  Computer  Vision,  1987. 

[18]  Ponce, J.,  and  Faugeras,  O.,  An  object  centered  hi¬ 
erarchical  representation  for  3D  objects:  the  Prism 
Tree,  Comp.  Vis.  Gr.  Image  Pr.,  1987  (to  appear). 

[19]  Requicha,  A.A.G.,  and  Voelcker,  H.B.,  Solid  model¬ 
ing:  a  historical  summary  and  contemporary  assess¬ 
ment,  IEEE  Comp.  Graphics  and  Appl.,  Vol  2,  No  2 
(1982),  pp.  9-24 

[20]  Roth,  S.D.,  Ray  Casting  for  modelling  solids,  Comp. 
Gr.  Image  Pr.,  Vol  18  (1982),  pp.  109-144. 

[21]  Samet,  H.,  Neighbor  finding  techniques  for  images 
represented  by  Quadtrees,  Comp.  Gr.  Image  Pr., 
Vol  18  (1982). 


9  ao 


[22]  Samet,  H.,  The  Quadtree  and  related  hierarchical 
data  structures,  Computing  Surveys,  Vol.  16,  No  2, 
pp.  187-260  (1984). 

[23]  Shafer,  S.A.,  and  Kanade,  T.,  The  theory  of  straight 
homogeneous  generalized  cylinders,  Carnegie  Mellon 
Univ.,  Comp.  Sc.  Dept.  Tech.  Rep.  CMU-CS083- 
105  (1983). 

[24]  Tanimoto,  S.,  and  Pavlidis,  T.,  A  hierarchical  data 
structure  for  picture  processing,  Comp.  Gr.  Image 
Pr.,  Vol  4,  No  2  (1975). 

[25]  Tilove,  R.B.,  Set  membership  classification:  a  unified 
approach  to  geometric  intersections  problems,  IEEE 
Trans,  on  Comp.,  C  29  (10)  (1980),  pp.  874-883. 

[26]  Whitted,  T.,  An  improved  illumination  model  for 
shaded  display,  CACM,  Vol.  23,  No  6  (1980),  pp. 
343-349. 


941 


Line-Drawing  Interpretation  : 
Straight -Lines  and  Conic -Sections 

Vic  Nalwa 

A. I.  Lab.,  Stanford  University,  CA  94805 


Abstract 

Line-drawings  corresponding  to  orthographi- 
cally  projected  images  oj  man-made  scenes  often 
exhibit  instances  of  straight- lines  and  conic- 
sections,  i.e.,  ellipses,  parabolas,  and  hyperbolas. 
We  investigate  constraints  imposed  on  the  scene  by 
such  instances  under  the  assumption  of  general 
viewpoint,  i.e.,  under  the  assumption  that  the  pro¬ 
perties  of  the  mapping  of  the  viewed  surface  onto 
the  line-drawing  are  stable  under  perturbation  of  the 
viewpoint  within  some  open  set  on  the  Gaussian 
sphere.  The  viewed  surfaces  are  assumed  to  be 
piecewise  C3.  The  main  results  of  this  paper  are 
that  straight-lines,  ellipses,  parabolas,  and  hyperbo¬ 
las  in  line-drawings  are  orthographic  projections  of 
scene-edges  which  are  also  straight-lines,  ellipses, 
parabolas,  and  hyperbolas,  respectively.  Further, 
continuous-surface-tangent  depth  discontinuities 
which  orthographically  project  onto  straight-lines 
can  be  described  locally  by  developable  surfaces,  and 
those  which  orthographically  project  onto  conic- 
sections  can  be  described  locally  by  quadric  sur¬ 
faces.  It  is  also  shown  that  scene  events  which 
orthographically  project  onto  straight-lines  or 
conic-sections  cannot  be  combinations  of 
viewpoint-independent  and  viewpoint-dependent 
edges. 

I.  Introduction 

Line-drawings  are  representations  of  edges  in 
an  image.  An  edge  in  an  image  may  be  caused  by 
various  events  in  the  scene.  The  scene-event  may 
be  a  surface-tangent  discontinuity,  a  depth  discon¬ 
tinuity,  a  surface-reflectance  discontinuity  or  an 
illumination  discontinuity  (shadow).  Fig.  1  illus¬ 
trates  the  various  cases.  Notice  that  a  depth 


discontinuity  may  also  simultaneously  be  a 
surface-tangent  discontinuity.  We  distinguish  this 
from  a  continuous-surface-tangent  depth  discon¬ 
tinuity.  While  surface-tangent  discontinuities, 
illumination  discontinuities,  and  surface-reflectance 
discontinuities  constitute  viewpoint-independent 
scene-edges,  continuous-surface-tangent  depth 
discontinuities  are  viewpoint-dependent. 

Although  the  depth  information  is  lost  under 
projection,  we  can  nevertheless  impose  some  con¬ 
straints  on  the  3-D  scene  by  investigating  the  2-D 
configuration  of  the  corresponding  line-drawing. 
We  restrict  our  attention  to  orthographic  projection 
here.  Orthographic  projection  is  a  reasonable 
approximation  to  perspective  projection  whenever 
the  angle  subtended  by  the  object  at  the  viewpoint 
is  small.  Orthographically  projected  line-drawings 
of  man-made  scenes  often  exhibit  instances  of 
straight-lines  and  conic-sections,  i.e.,  ellipses,  para¬ 
bolas,  and  hyperbolas.  We  investigate  constraints 
imposed  on  the  scene  by  such  instances  under  the 
assumption  of  general  viewpoint,  i.e.,  under  the 
assumption  that  the  mapping  of  the  viewed  surface 
onto  the  line-drawing  is  stable  under  perturbation 
of  the  viewpoint  within  some  open  set  on  the  Gaus¬ 
sian  sphere1.  The  stability  we  demand  includes  the 
requirement  that  the  categorization  of  the  curve  in 
the  line-drawing  as  a  straight-line,  ellipse,  parabola, 
or  hyperbola,  be  preserved  under  perturbation  of 
viewpoint.  Other  properties  of  the  mapping  with 
respect  to  which  we  require  stability  will  be  men¬ 
tioned  as  we  proceed.  As  long  as  the  total  number 

1  The  Gaussian  sphere  is  a  device  to  represent  orienta¬ 
tion  in  space.  Each  point  on  a  unit  sphere,  called  the 
Gaussian  sphere  (see  [Hilbert  and  Cohn-Vossen,  19S2|), 
corresponds  to  the  unit  vector  from  the  center  of  the 
sphere  to  that  point. 


942 


of  such  requirements  is  finite,  the  set  of  viewing 
directions  on  the  Gaussian  sphere  where  the  general 
viewpoint  assumption  is  invalid  is  closed  and  has 
measure  zero2.  It  should  be  noted  that  given  the 
loss  of  information  about  the  third  dimension  under 
projection,  one  cannot  hope  to  make  any  substan¬ 
tial  deductions  about  the  imaged  surfaces  without 
the  general  viewpoint  assumption. 

It  is  easy  to  see  that  the  only  space  curves 
which  orthographically  project,  under  general 
viewpoint,  to  straighHines,  ellipses,  parabolas,  or 
hyperbolas  are  those  which  belong  to  the 
corresponding  family.  This  observation,  however, 
pertains  only  to  all  viewpoint-independent  scene- 
edges.  The  situation  is  more  subtle  for  continuous- 
surface-tangent  depth  discontinuities  where  the 
viewing  direction  is  tangential  to  the  surface.  To 
illustrate  this,  we  present  in  F  ig.  2  an  instance  of  a 
C 1  curve  in  a  line-drawing  which  is  the  ortho¬ 
graphic  projection  of  a  C  0  curve  in  space.  The  sur¬ 
face  in  the  figure  consists  of  a  hemisphere  smoothly 
joined  to  one  end  of  a  right  circular  cylinder. 

We  examine  lines  resulting  from  depth  discon¬ 
tinuities  here.  All  viewpoint-independent  edges  can 
be  included  in  the  following  analysis  by  assuming, 
without  any  loss  of  generality,  all  non-depth  discon¬ 
tinuities  to  be  replaced  by  discontinuous-surface- 
tangent  depth  discontinuities.  Such  a  replacement 
retains  the  viewpoint-independent  character  of  the 
scene-edges  while  making  them  amenable  to 
analysis  as  depth  discontinuities.  It  will  turn  out 
that  combinations  of  viewpoint-independent  and 
viewpoint-dependent  scene-edges  cannot  ortho- 


*  An  arbitrary  set  in  any  manifold  has  measure  xero  if 
its  preimage,  for  every  local  parametriration,  can  be 
covered  by  a  countable  number  of  rectangular  solids  with 
an  arbitrarily  small  total  volume  in  Euclidean  space.  For 
our  purposes  it  suffices  to  note  that  the  Gaussian  sphere  is 
a  two  dimensional  manifold  and  consequently  all  sets  of 
points  and  lines  on  it  have  measure  tero.  The  null  set  of 
course  has  measure  tero.  On  the  other  hand,  no  open  set 
on  the  Gaussian  sphere  has  measure  sero. 

It  can  be  shown  that  the  union  of  a  countable 
number  of  sets  of  measure  tero  is  also  of  measure  tero  (see 
[Guillemin  and  Pollack,  1974]).  Therefore,  the  union  of  a 
finite  number  of  dosed  sets  of  measure  tero  is  closed  and 
of  measure  tero.  The  complement  of  such  a  union  is  open 
and  of  full  measure. 


graphically  project  to  straight-lines  or  conic- 
sections  under  general  viewpoint.  Pairs  of  straight¬ 
lines  are  exceptions  to  this  rule.  These  are  degen¬ 
erate  conic-sections  which  may  have  one  of  the  two 
lines  viewpoint-independent  while  the  other  is 
viewpoint-dependent.  Throughout  our  discussion 
we  will  treat  pairs  of  straight-lines  to  be  instances 
of  straight-lines  rather  than  of  conics. 

II.  Preview  of  Results 

We  summarize  here  the  results  of  this  paper. 
It  is  assumed  that  the  projected  surfaces  are  piece- 
wise  C 3,  i.e.,  the  surfaces  comprise  finitely  many 
C3  patches,  each  bounded  by  a  finite  piecewise  C'3 
curve.  This  restriction  does  not  significantly  con¬ 
strain  the  domain  as  tong  as  we  do  not  require  a 
priori  knowledge  of  the  C 3  surface-patch  boun¬ 
daries.  The  other  assumptions  are  explicitly  stated 
in  the  theorems.  The  straight-lines  and  conic- 
sections  in  the  line-drawing  need  not  be  continuous 
for  the  theorems  to  be  valid,  i.e.,  the  constraints 
imposed  on  the  viewed  surface  hold  even  if  the 
curves  in  the  line-drawing  are  fragmented. 

Theorem  1  :  Straight-lines  in  a  line-drawing 
obtained  under  orthographic  projection  with  a  gen¬ 
eral  viewpoint  are  projections  of  straight-lines  in 
space. 

Theorem  2  :  Ellipses,  parabolas  and  hyperbolas  in 
a  line-drawing  obtained  under  orthographic  projec¬ 
tion  with  a  general  viewpoint  are  projections  of 
ellipses,  parabolas  and  hyperbolas,  respectively,  m 
space. 

Theorem  3  :  Circles  in  a  line-drawing  obtained 
under  orthographic  projection  with  a  general 
viewpoint  are  projections  of  circles  in  space  which 
are  confined  to  lie  in  planes  parallel  to  the  image 
plane. 

Theorem  4  :  Scene  events  which  orthographically 
project,  under  general  viewpoint,  onto  straight-lines 
or  conic-sections  in  line-drawings  must  either  be 
completely  viewpoint-independent  or  completely 
viewpoint- dependent. 

Theorem  6  :  Continuous-surface-tangent  depth 
discontinuities  which  orthographically  project,  under 
general  viewpoint,  onto  straight-lines  in  line- 


943 


drawings  can  be  described  locally  by  developable  sur¬ 
faces. 

Theorem  8  :  Continuous-surface-tangent  depth 
discontinuities  which  ortho  graphic  ally  project,  under 
general  viewpoint,  onto  conic-sections  in  line- 
drawings  can  be  described  locally  by  quadric  sur¬ 
faces.  Specifically,  continuous-surface-tangent 
depth  discontinuities  projecting  onto  ellipses  by 
ellipsoids  or  hyperboloids  of  one  sheet,  onto  parabo¬ 
las  by  elliptic  paraboloids  or  hyperbolic  paraboloids, 
and  onto  hyperbolas  by  hyperboloids  of  one  sheet  or 
hyperboloids  of  two  sheets.  Furthermore,  these  qua¬ 
dric  surfaces  are  completely  determined,  up  to  three 
degrees  of  freedom,  by  the  conic-sections  in  the 
line-drawings. 

Theorem  7  :  Depth  discontinuities  which  ortho¬ 
graphic  ally  project,  under  general  viewpoint,  onto 
circles  in  line-drawings  can  be  described  locally  by 
spheres  whose  radii  are  identical  to  the  radii  of  the 
circles  in  the  line-drawings. 

Section  III  provides  a  formulation  for  the 
geometry  of  the  problem.  The  results  derived 
within  are  used  throughout  the  rest  of  the  paper. 
Sections  IV,  V  and  VI  discuss  straight-lines,  conic- 
sections,  and  circles,  respectively.  In  section  VII  we 
extend  the  results  derived  in  the  preceding  sections. 
We  conclude  with  section  VIII. 

HI.  Preliminaries 

We  begin  by  formulating  the  geometry  of  the 
problem  before  we  get  down  to  the  theorems  and 
their  proofs.  The  viewed  surface  is  assumed  to  be 
piecewise  C3.  A  piecewise  C'3  surface  is  defined  to 
comprise  finitely  many  C 3  patches,  each  bounded 
by  a  finite  piecewise  C 3  curve. 

Consider  the  projection  geometry  shown  in 
Fig.  3.  The  projection  direction  is  along  the  z-axis. 
The  figure  also  shows  a  curve  which  results  from 
the  intersection  of  the  viewed  surface  with  a  plane 
parallel  to  the  y-z  plane.  The  depth  discontinuity, 
say  (x,  ,  y,  ,  z,  ),  which  lies  on  this  curve  has  its 
tangent  oriented  along  the  direction  of  projection. 
If  we  perturb  the  viewpoint  by  the  counter¬ 
clockwise  angle  A  about  the  x-axis,  it  is  clear  that 
the  depth  discontinuity  before  perturbation  of 


viewpoint  may  no  longer  remain  a  depth  discon¬ 
tinuity.  Fig.  4  shows  the  original  depth  discon¬ 
tinuity  (yj ,  Zj )  and  the  new  depth  discontinuity 
(y, 1  ,  Zj 1  )  which  lies  on  the  same  cross-sectional 
curve.  Note  that  the  new  discontinuity  is  expressed 
in  the  perturbed  frame  of  reference  which  has  its 
z-axis  oriented  along  the  new  projection  direction. 

Let  the  curvature  of  the  cross-sectional  curve 
at  (yj ,  Zj )  be  Kti ,  where  the  subscript  x  refers  to 
the  line  in  the  x-y  plane  about  which  the  viewpoint 
is  perturbed,  i.e.,  the  x-axis.  We  need  to  interpret 
Kjj .  As  shown  in  Fig.  3,  the  surface  cross-section 
normal,  in  general,  forms  a  non-zero  angle  with  the 
surface  normal  at  the  same  point.  Let  us  call  this 
angle  .  The  subscript  x  on  indicates  that  it  is 
a  function  of  the  axis  about  which  the  viewpoint  is 
perturbed.  Then,  we  may  write  =  *c,  /cos^„ 
(see  [Lipschutz,  1969|)  where  /c,  is  the  normal  cur¬ 
vature  of  the  surface  in  the  projection  direction  at 
the  point  (xj ,  yj ,  Zj ).  The  radius  of  curvature  of 
the  cross-sectional  curve  at  the  depth  discontinuity 

is  rn  =— and  so  we  have  the  equivalent  relation 
Kti 

r„  =  r,  cos0„  where  r,  =  —  is  the  radius  of 

Kj 

normal  curvature  in  the  projection  direction. 

In  case  of  a  tangent  discontinuity  on  the 
cross-sectional  curve  at  (yj ,  Zj  ),  we  may  let 
-*-oo  and  all  the  following  analysis  will  remain 
valid.  If  the  cross-sectional  curve  has  a  curvature 
discontinuity  at  (yj ,  zs ),  then  we  assume  k„  to 
take  on  one  of  the  two  limiting  values  depending  on 
which  side  of  the  original  depth  discontinuity  the 
perturbed  depth  discontinuity  lies*.  This  remark 
will  be  further  clarified  as  we  proceed.  It  will  be 
shown  later  that  for  orthographic  projections  which 
are  conic-sections  under  general  viewpoint,  curva¬ 
ture  discontinuities  are  disallowed  at  the  depth 
discontinuities.  For  straight-lines,  on  the  other 

*  It  is  implicitly  assumed  here  that  the  projection  direc¬ 
tion  is  not  tangential  to  the  boundaiy  of  a  C*  surface 
patch  at  any  depth  discontinuity.  Observe  that  the  set  of 
orientations  which  are  tangential  to  any  boundary  is  closed 
and  of  mens  ore  sero  on  the  Gaussian  sphere.  As  the 
number  of  boundaries  is  finite,  the  union  of  such  sets  is 
also  closed  and  of  measure  sero.  Hence,  the  assumption  is 
valid  under  general  viewpoint. 


944 


hand,  it  will  be  shown  that  we  must  either  have 
curvature  discontinuities  at  the  depth  discontinui¬ 
ties  on  every  cross-sectional  curve  or  at  none. 
However,  even  here  the  viewpoint  will  turn  out  to 
be  non -general  in  that  the  set  of  such  viewpoints  is 
closed  and  of  measure  zero. 

It  is  reasonable  to  assume  that  k„  ^0.  The 
justification  for  this  assumption  will  be  presented 
below.  Then,  ignoring  higher  order  terms,  we  could 
describe  the  local  cross-sectional  curve  about 
(yi ,  Zj )  as  follows. 

(y  ~Vi)  =  ~  (!) 

Perturbation  of  the  reference  frame  causes  the  fol¬ 
lowing  transformation  of  coordinates. 
x  -*  x1 

y  -h ►  y1  cosA  -  z1  sin  A 
z  -»  y'  sin  A  +  z'  cos  A 

Equation  (1)  will  then  read 
( y '  cosA  -  z'  sin  A  -  y, ) 

=  -jK„(y'  sin  A  +  z'  cosA  -  z,  )2  .  (2) 

The  higher  order  terms  may  be  tentatively 
neglected  because  their  contribution  becomes  rela¬ 
tively  insignificant  as  the  size  of  the  perturbation  is 
reduced.  However,  it  may  turn  out  that  we  are 
unable  to  satisfy  our  requirements  by  only  con¬ 
straining  the  local  tangents  and  curvatures.  In  that 
case  we  would  have  to  include  the  higher  order 
terms  in  our  analysis. 

The  new  depth  discontinuities  under  pertur¬ 
bation  of  viewpoint  can  be  found  by  equating 
(dy '  /dz'  )  in  equation  (2)  to  zero.  This  leads  to 
equation  (3)  where  (y/  ,  z* '  )  are  the  coordinates 
of  the  new  depth  discontinuity  corresponding  to  the 
original  depth  discontinuity  with  the  same  x- 
coordinate.  Equation  (4)  is  obtained  by  substitut¬ 
ing  for  the  equality  (3)  in  equation  (2). 

y, '  sin  A  +  zJ  cosA  =  z, - —  tan  A  (3) 

k* 

y,'  cosA-z,'  sin  A  =  y,  +  — - — tan2A  (4) 

2/c„- 

Using  (3)  and  (4),  and  making  the  substitution 
r„=-i-,  we  finally  express  the  new  depth  discon- 

Ksi 


tinuities  as  follows. 

(5) 

Vi'  —  (v.  +  z,-  tan  A  -  tan2A)cosA 

Z; '  =  (z,  -  y,  tan  A  -  r„  tan  A  -  r„  tan3A)cosA 

Before  proceeding  to  the  next  section,  we 
present  a  justification  for  the  assumption  that 
tc„^0.  The  reader  may  wish  to  postpone  going 
through  this  argument  till  later.  First,  it  can  be 
argued  that  is  either  zero  all  along  the  depth 
discontinuity  curve  or  it  is  not  zero  anywhere  along 
it.  The  reason  for  this  is  that,  under  perturbation 
of  viewpoint,  points  with  non-zero  /c„  move  an 
order  of  magnitude  slower  than  points  with  zero 
/c„  .  Hence,  if  were  not  uniformly  zero  or  non¬ 
zero  along  the  depth  discontinuity  curve,  then  the 
projection  would  not  remain  a  straight-line  or 
conic-section  under  perturbation  of  viewpoint.  This 
argument  could  be  made  more  formal  only  after  the 
presentation  of  the  following  two  sections.  Next, 
we  argue  that  Kn  cannot  be  zero  all  along  the 
depth  discontinuity  curve,  for  a  general  viewpoint. 
If  any  of  the  cross-sectional  curves  has  a  tangent 
discontinuity  at  the  depth  discontinuity,  then  k„ 
will  be  non-zero  there.  Let  us  assume  that  this  is 
not  so,  i.e.,  is  zero  all  along  the  depth  discon¬ 
tinuity  curve  and  the  surface-tangent  is  continuous 
along  this  curve.  Then,  no  depth  discontinuity  on 
any  cross-sectional  curve  will  remain  a  depth 
discontinuity  under  perturbation  of  viewpoint.  It 
follows  that,  under  general  viewpoint,  there  must 
exist  open  intervals  on  the  depth  discontinuity 
curve  which  lie  within  the  interior  of  C3  patches 
Now,  every  non-boundary  depth  discontinuity  point 
belongs  to  one  of  the  following  categories  :  elliptic, 
parabolic,  hyperbolic  or  planar  (see  [Lipschutz. 
1969]).  Because  every  planar  depth  discontinuity 
must  transit  to  a  non-planar  depth  discontinuity 
under  perturbation  of  viewpoint4,  we  will  always 

4  Planar  points  form  closed  sets  on  the  surface.  A  great 
circle  on  the  Gaussian  sphere  describes  the  viewing  direc¬ 
tions  for  which  points  in  any  one  such  set  may  lie  on  a 
depth  discontinuity  curve.  As  every  great  circle  on  the 
Gaussian  sphere  is  a  closed  set  of  measure  sero,  the 
described  transition  must  occur  under  perturbation  of  the 
viewpoint. 


945 


have  some  open  interval  on  the  depth  discontinuity 
curve  which  lies  within  some  open  surface  set  con¬ 
sisting  solely  of  elliptic,  parabolic,  or  hyperbolic 
points.  In  case  one  such  elliptic  patch  exists,  /c, , 
and  therefore  k„  ,  cannot  be  identically  zero  along 
the  depth  discontinuity  curve.  In  case  of  a  para¬ 
bolic  patch,  each  point  has  only  one  direction  in 
which  the  surface  curvature,  k,  ,  is  zero.  Consider 
the  cross-sectional  curve,  of  the  type  shown  in  Fig. 
3,  on  which  one  such  point  lies.  Because  our  para¬ 
bolic  surface  patch  is  C3,  the  asymptotic  directions 
along  the  cross-sectional  curve  must  either  map 
onto  a  point  or  a  C  1  curve  on  the  Gaussian  sphere. 
In  either  case,  along  the  cross-sectional  curve 
can  be  zero  only  for  a  closed  set  of  measure  zero  on 
the  Gaussian  sphere.  Therefore,  k ^  cannot  be 
identically  zero  along  the  depth  discontinuity 
curve,  under  general  viewpoint.  An  identical  argu¬ 
ment  holds  for  hyperbolic  points,  except  that  now 
we  consider  two  separate  mappings,  one  for  each  of 
the  asymptotic  directions.  In  this  case,  the  desired 
result  follows  from  the  observation  that  the  union 
of  two  closed  sets  of  measure  zero  is  also  closed  and 
of  measure  zero.  This  completes  the  argument  that 
for  a  piecewise  C  3  surface  and  a  general  viewpoint, 
ku  is  not  zero  at  any  point  on  the  depth  discon¬ 
tinuity  curve. 

IV.  Straight-Lines 

Theorem  1  Straight-lines  in  a  line-drawing 
obtained  under  orthographic  projection  with  a  gen¬ 
eral  viewpoint  are  projections  of  straight-lines  in 
space. 

Proof  :  For  reasons  discussed  in  the  introduction, 
it  is  sufficient  to  verify  the  truth  of  the  theorem  for 
depth  discontinuities.  Consider  any  three  points, 
say  (xlt  ylt  z,),  (x2,  y2,  z2)  and  (x3,  y3,  z3) , 

which  project  to  a  straight-line  in  a  line-drawing. 
Collinearity  of  the  orthographic  projection  of  the 
three  points,  under  perturbation  of  viewpoint, 
requires  that 

Vi  ~Vi  Vi  ~Vi 

ml  ml  -  I  -  I  l*) 

1 1  ~  x2  X1  -  x3 

Substitution,  using  (5), ‘and  simplification  using  the 
collinearity  of  (xlf  y  ,),  (x2,  y2)  and  (x3,  y3)  , 


the  original  orthographic  projections  of  the  three 
points,  leads  to  the  following  equality. 

zi  ~  zi  _  zi  ~  zi  1  2 

xt  -  x2  x  i  —  x  3  j  tanA 

=  liil r*2  _  ~  r»3  m 

Zl  -  X2  2,  -  Z3 

If  the  three  points  are  collinear  in  space,  then  the 
left-hand-side  of  the  equation  is  identically  zero  and 
we  have  the  following  constraint  on  the  radii  of 
curvature,  r„ ,  at  the  points  along  the  cross- 
sectional  curves  perpendicular  to  the  axis  about 
which  the  viewpoint  is  perturbed,  i.e.,  the  x-axis 
here. 

.r,l~ r*2  =  h  ~  fi  (8) 

rn  -  r,3  xl-x3 

If  the  three  points  are  not  collinear,  then  the  left- 
hand-side  of  (7)  is  a  function  of  the  angle  A  while 
the  right-hand-side  is  constant  for  any  choice  of 
points  and  curvatures.  Hence,  the  equation  cannot 
be  satisfied  under  such  circumstances. 

Q.E.D. 

We  now  consider  the  constraints  imposed 
upon  the  viewed  surface  by  the  equality  (8).  We 
use  the  previously  discussed  relation  r„  =  r,  cos^x, 
for  this  purpose.  Although  ,  in  general,  depends 
on  the  point  under  consideration,  for  a  straight-line 
projection  under  general  viewpoint  it  is  constant 
because  of  Theorem  1  and  may  be  expressed  as  <p ,  . 
Consequently,  equation  (8)  can  be  rewritten  as  fol¬ 
lows. 


This  implies  that  the  radius  of  normal  curvature,  in 
the  direction  of  projection,  along  the  straight  sur¬ 
face  depth  discontinuity  varies  linearly  with  the 
distance  measured  along  the  line.  Hence,  it  is 
either  identically  zero  or  it  is  zero  at  most  one  iso¬ 
lated  point.  In  the  former  case  the  edge  is  a 
tangent  discontinuity  which  is  viewpoint- 
independent  while  in  the  latter  case  it  is  a 
viewpoint-dependent  depth  discontinuity.  Hen¬ 
ceforth  in  this  discussion,  we  will  restrict  ourselves 
to  the  latter  case. 

From  the  linear  variation  of  r,  it  should  be 


946 


clear  that  either  the  curvatures  of  the  cross- 
sectional  curves  are  discontinuous  at  every  depth 
discontinuity  or  at  none.  We  may  assume,  without 
any  loss  of  generality,  that  the  curvatures  are  con¬ 
tinuous  at  every  depth  discontinuity  because  the 
set  of  viewpoints  for  which  this  assumption  is  false 
is  closed  and  of  measure  zero.  Now  we  want  to 
deduce  that  the  surface  is  locally  C  2.  The  surface 
would  not  be  locally  C  2  only  if  some  depth  discon¬ 
tinuity  point  is  on  the  boundary  of  a  C3  surface 
patch  and  it  does  not  have  a  twice  differentiable 
neighborhood.  For  every  such  point,  as  previously 
shown,  the  tangent  to  the  boundary  does  not  lie 
along  the  projection  direction,  under  general 
viewpoint.  Then,  if  any  such  point  is  to  lie  on  the 
depth  discontinuity  curve,  the  cross-sectional  curve 
through  it  cannot  be  C2.  Hence,  the  surface  must 
be  locally  C 2. 

It  is  not  hard  to  see  that  the  normal  curva¬ 
ture  of  the  surface  in  the  direction  of  the  depth 
discontinuity  curve  is  one  of  the  principal  curva¬ 
tures  and  is  identically  zero.  The  other  principal 
curvature  has  a  fixed  orientation,  orthogonal  to  the 
straight-line  depth  discontinuity  curve,  and  varies 
linearly  along  it.  As  these  surface  properties  are 
preserved  under  perturbation  of  viewpoint,  the  sur¬ 
face  must  be  locally  developable,  i.e.,  cylindrical, 
conical,  or  tangential  developable  (see  [Hilbert  and 
Cohn-Vossen,  1952]). 

V.  Conic-Sections 

Theorem  2  :  Ellipses,  parabolas  and  hyperbolas  in 
a  line-drawing  obtained  under  orthographic  projec¬ 
tion  with  a  general  viewpoint  are  projections  of 
ellipses,  parabolas  and  hyperbolas,  respectively,  in 
space. 

Proof  :  For  reasons  discussed  in  the  introduction, 
it  is  sufficient  to  verify  the  truth  of  the  theorem  for 
depth  discontinuities.  We  first  show  that  space 
curves  which  project  to  conics  under  general 
viewpoint  must  be  planar.  Straight-lines  are 
excluded  from  this  discussion. 

Consider  any  five  points,  say 
(Xj ,  ys ,  z, ),  i  =  1..5,  which  project  onto  the 
conic.  The  orthographic  projection  direction,  as 


before,  is  the  z-axis.  The  five  chosen  points 
uniquely  determine  the  conic, 

ax 2  +  bxy  +  cy  2  +  dx  +  ey  +  f  =0.  The 

conic  coefficients  are  the  solution  to 
P  [a  b  c  d  e  f  jT  =  0,  where 


2  2 

x\Vi 

v,2 

2 1 

y  1 

1 

Z22 

22y2 

y  2 

22 

y  2 

1 

X{ 

23P3 

yi 

23 

y  3 

1 

*42 

241/4 

y  Z 

24 

y  4 

1 

2  5 

2sP5 

Vi 

25 

1/5 

1 

Now  let  us  choose  a  sixth  point,  say  (x8,  y„,  z8), 
which  also  projects  onto  the  conic  determined 
above.  Then,  we  may  write 

\*e  XtVt  Ve  xe  Ve  lj 

—  I'll  *)2  f>3  r>4  n5]  P  (10) 

The  r/i  are  uniquely  determined  owing  to  the  linear 
independence  of  the  rows  of  P.  Using  (10),  we  may 
express  the  tj ,  as  follows. 

[»h  »?2  f)3  V 4  Vs] 

—  \*e  *e He  Vt  xe  ye  l]PT(PPT)'1  (11) 

Now,  we  perturb  the  viewpoint  about  the  x- 
axis  by  the  counter-clockwise  angle  A,  and  once 
again  use  the  expressions  (5)  for  the  new  depth 
discontinuities.  For  small  A,  we  may  substitute 
tanA  by  A  and  cosA  by  unity.  Then,  the  vector 
[z, 2  z,  y,  y2  z,  y,  l]  transforms  to 

l*<2  Xi(yt  +  a  -  -yr„  A2)  (  .  )2  z,  (  .  )  1|. 

P  is  correspondingly  transformed  to  P  '  .  For  the 
projected  curve  to  remain  a  conic,  it  is  necessary 
that 

(*•  xe(Ve  +  *«A  |-rI#A2)  (  .  )2  z8  (.)  l| 

=  [^1  ^2  ^3  ^4  M  P  '  ( 1-) 

for  some  X, .  Ignoring  higher  order  terms,  we  may 
approximate  X,  by  (rj,  +  ps  A  +  qs  A2).  Then, 
equating  the  coefficients  of  the  A  terms,  we  get  the 
following  equation. 

\Vl  l?2  Vs  *74  *7s]  Q  +  (Pi  P  2  P  3  P*  Ps]  P 

=  [0  z8z8  2 y8z8  0  z 8  0]  (13) 


947 


xiz\  %y\z\  o  2!  o 


x2z2  2y 2Z 2  0  z2  0 
where  Q  =  0  X3Z3  21/3X3  0  23  0 

0  xt z<  2y4z4  0  z4  0 
0  X5X5  2y $z 5  0  x5  0 

We  may  rewrite  (13),  using  (11),  as  follows. 

[x82  z8|/8  yi  xt  Ve  llPT(PPT)»Q 
-  [0  xtzt  2ytz6  0  z8  0] 

+  (p  1  P 2  P 3  P 4  Psl  P  =  0  (14) 

Now,  notice  that  the  first  two  row  vectors  have 
zeros  for  the  first,  fourth  and  sixth  elements. 
Hence,  we  may  rewrite  (14)  as 

[x82  x8j/8  j ,82  x«  x „  1]PT(PP1T1Q' 

-  [x 8  2 yt  1  \zt  +  [p  1  p2]R  =0  (15) 

xiz  1  2j hZi  z, 
x 2Z 2  2y  2z  2  z 2 
where  Q'  =  |  x3z3  2y3z3  z3 
X4Z4  2  y4z4  z  4 
Xszs  2 1/5X5  x5 


and  R  depends  on  (x, ,  y, ),  i  =  1..5. 

Because  P  was  a  full  rank  matrix,  the  two  rows  of 
R  are  independent.  It  follows  from  this  observation 
that  the  combination  of  the  first  two  terms  in  equa¬ 
tion  (15)  must  lie  in  the  plane  defined  by  the  two 
rows  of  R,  irrespective  of  our  choice  of  (x8,  y8,  z8), 
as  long  as  this  point  projects  onto  the  conic,  i.e.,  x8 
and  y  8  satisfy  (10).  Now,  for  a  curve  spanned  by  a 
row  vector,  say  v(t),  to  be  planar  its  torsion  must 
be  identically  zero,  i.e.,  det  ( [v'(t  )  v"(t  )  v"'(t  )]T  ) 
must  equal  zero  (see  [Lipscbutz,  1969]).  Hence,  it  is 
necessary  that  the  combination  of  the  first  two 
terms  in  (15)  satisfy  this  requirement,  under  varia¬ 
tion  of  (x8,  y8,  z8)  =  (x  (t ),  y  (t ),  z  (t )).  Note 
that  the  original  coordinate  frame  could  always  be 
chosen  so  that  the  ellipses,  parabolas,  and  hyperbo¬ 
las  under  consideration  are  parameterized  as 
x  ( t )  =  m  cos t  and  y  ( t )  =  M  sin  t ,  x  ( t )  =  ml 
and  y(t)=Mt2,  and  x  (i )  =  m  coshf  and 
y  (< )  =  A/sinht ,  respectively.  Substitution  of  the 
parametric  forms  of  x8,  y  8,  and  z8  into  the  zero 
torsion  equation  will  lead  to  a  third  order  ordinary 
differential  equation  in  z(t).  This  differential  equa¬ 
tion  will  be  satisfied  by 

z  (t )  =  ax  (t )  +  by  (t )  +  c  because  (15)  is  just  a 
restatement  of  (13),  and  it  can  easily  be  verified 


that  (13)  is  satisfied  by  this  expression  for  z(t)  if  we 
make  all  the  p,  zero.  As  a,  b  and  c,  and  therefore 
z(t),  are  uniquely  determined  by 
z  (0),  z'(0),  and  z"(0),  it  follows  from  the  theory  of 
ordinary  differential  equations  that  the  planar 
expression  for  z(t)  is  the  unique  solution  on  any 
interval  containing  z(0)  (see  [Coddington,  1961]). 
To  be  thorough,  one  must  also  check  that  the 
Lipschitz  condition  is  satisfied  by  the  differential 
equation.  This,  however,  is  rarely  a  problem 

We  have  shown  above  that  the  space  curves 
corresponding  to  conic  orthographic  projections 
under  general  viewpoint  are  restricted  to  be  planar. 
If  the  orthographic  projection  of  a  planar  curve  is 
quadratic,  then  the  original  curve  must  also  be  a 
quadratic.  In  fact,  the  sign  of  the  discriminant 
(b2-4ac)  of  the  quadratic, 

ax2  +  bxy  +  cy  2  +  dx  +  ey  +  f  =0,  is 

unchanged  under  orthographic  projection.  These 
claims  can  easily  be  verified  by  noting  that  any 
orthographic  projection  of  a  planar  curve  may  be 
computed,  up  to  a  rotation,  by  compressing  the 
curve  along  some  direction  in  the  plane  of  the 
curve.  Hence,  we  have  proven  that  ellipses,  para¬ 
bolas,  and  hyperbolas  in  a  line-drawing  obtained 
under  orthographic  projection  with  a  general 
viewpoint  are  projections  of  space  curves  which 
themselves  are  ellipses,  parabolas,  and  hyperbolas, 
respectively. 

Q.E  D 

As  we  did  for  straight  lines  previously,  \  e 
now  consider  the  constraints  other  than  planarity 
which  can  be  imposed  on  the  viewed  surface  in  the 
vicinity  of  the  depth  discontinuities.  Once  again 
consider  the  substitution  of  (77,  +  pj  A  +  q,  A')  for 
X j  in  equation  (12).  Also  note  that  we  have  shown 
above  that  ail  the  pi  are  zero  and  that 
Zj  =  ax,  +  by j  -I-  c  .  Using  this  information  and 
equating  the  coefficients  of  the  A2  terms,  we  get  the 
following  equation. 

[7l  Vz  r)  3  14  ^5]  S  +  ]?!  ?2  q  3  ?s|  P 

=  [0  --jx8rx8  -j/8r,8  0  ~-jr,t  0|  (16) 


943 


But  this  is  precisely  of  the  same  form  as  equation 
(13)  with  q(  replacing  p(  and  --jrxj  replacing  z,. 

Hence,  by  the  same  arguments  as  above,  we  may 
conclude  that  the  radius  of  curvature,  r „ ,  at  the 
depth  discontinuity  (x* ,  y* ,  z,  ),  along  the  cross- 
sectional  curve  formed  by  the  intersection  of  the 
viewed  surface  with  a  plane  parallel  to  the  y-z 
plane,  can  be  expressed  as  (az,  +  @y,  +  7). 
Employing  the  relation  =  r<  cos <pxi ,  we  get 
r,  COS0,,  =  (ax,  +  0y,  +  7)- 
First,  note  that  <j>ti  can  be  expressed  as  ) 

where  <p,  is  determined  by  the  tangent  to  the 
conic-section  in  space  and  i>,  depends  on  the  axis 
about  which  the  viewpoint  was  perturbed,  in  this 
case  the  x  axis.  Next,  observe  that  a,  0  and  7  must 
depend  on  <p,  .  Hence,  we  rewrite  the  expression  as, 

r  =  a(^x)  *.  +  0[4>i)  y,  +  (1?, 

cos(^j )  cos(<^>, )  -  sin(0, )  sin(0, ) 

In  the  discussion  so  far,  we  have  considered  the  per- 
1  turbation  to  be  about  the  x-axis.  However,  we  were 

free  to  perturb  the  viewpoint  about  any  axis  in  the 
x-y  plane.  Because  r,  is  a  surface  property 
*  independent  of  the  direction  of  perturbation,  we 

require  that  the  right-hand-side  of  (17)  be  indepen- 
;  dent  of  <t> ,  , 

If  the  numerator  in  (17)  is  not  identically 
zero,  then  it  can  be  zero  at  most  two  points  along 
the  depth  discontinuity  curve.  Hence,  either  r,  is 
identically  zero  or  it  is  zero  at  most  two  isolated 
points.  We  conclude  that  no  edge  which  projects 
to  a  conic-section  under  general  orthographic  pro¬ 
jection  can  be  a  combination  of  viewpoint- 
independent  and  viewpoint-dependent  edges.  Hen¬ 
ceforth,  in  this  discussion,  we  will  restrict  our 


attention  to  viewpoint-dependent  depth  discon¬ 
tinuities. 

We  had  previously  promised  a  justification  for 
the  absence  of  curvature  discontinuities  at  depth 
discontinuities  which  project  to  conic-sections.  The 
argument  runs  as  follows.  We  choose  four  depth 
discontinuities  and  perturb  three  of  them  keeping 
the  fourth  one  fixed.  This  can  always  be  done  by 
choosing  the  projection  of  the  surface  normal  at  the 
fourth  point  to  be  the  axis  of  perturbation.  Now 
note  that  rn  at  the  fourth  will  be  well  defined  by 
the  three  other  r„  because  the  three  depth  discon¬ 
tinuities  must  be  non-collinear.  But,  the  fourth 
point  could  have  been  chosen  at  any  depth  discon¬ 
tinuity.  It  follows  that  the  normal  curvature  in  the 
direction  of  projection  is  well  defined  at  every 
depth  discontinuity  which  projects  onto  a  conic- 
section.  By  the  same  arguments  as  those  presented 
for  the  straight-line  case,  we  deduce  that  the  sur¬ 
face  is  locally  C  2. 

If  the  orientation  of  the  plane  of  the  conic- 
section  in  space  were  known,  we  could  obtain 
expressions  for  x,  and  y,  in  terms  of  the  angle  <j>, 
formed  by  the  tangent  to  this  curve  with  a  fixed 
axis.  Then,  we  must  be  able  to  choose  a,  0  and  7 
such  that  the  right-hand-side  of  (17)  is  independent 
of  <f> , .  This  choice,  if  possible,  would  determine  r, 
up  to  a  multiplicative  factor.  Thus,  the  normal 
curvature  of  the  surface  along  the  depth  discon¬ 
tinuity  curve,  in  the  direction  of  projection,  has 
three  degrees  of  freedom,  two  for  the  orientation  of 
the  plane  of  the  conic-section  in  space  and  one  for 
the  multiplicative  factor.  Given  the  orientation  of 
the  plane  of  the  space  conic,  the  curvature  of  the 
surface  along  the  depth  discontinuity  curve,  in  the 
direction  of  the  tangent  along  the  conic,  can  also  be 
found.  It  may  seem,  at  first,  that  because  we  only 
know  the  normal  curvatures  of  the  surface  in  two 
directions  at  each  depth  discontinuity,  we  are  still 
one  constraint  short  of  being  able  to  determine  the 
principal  curvatures  and  their  orientations.  But, 
the  normal  curvature  in  a  third  direction  at  any 
point  may  be  determined  by  using  the  fact  that  the 
surface  is  C2  and  that  the  depth  of  a  point  along 
the  third  direction  can  be  expressed  as  a  function  of 


949 


the  normal  curvature  in  the  projection  direction  at 
an  adjacent  point.  (Higher  order  terms  are  disre¬ 
garded  as  before.)  Thus,  the  first  and  second  fun¬ 
damental  forms  (see  [Lipschutz,  1969])  of  the  sur¬ 
face  are  completely  determined,  up  to  three  degrees 
of  freedom,  along  the  depth  discontinuity  curve. 

The  reader  may  wonder  if  there  do  exist 
continuous-surface-tangent  surfaces  which  project 
onto  ellipses,  parabolas,  and  hyperbolas,  under  gen¬ 
eral  viewpoint.  There  do.  The  simplest  examples 
are  members  of  the  quadric  family.  We  show  this 
in  the  Appendix.  In  fact,  if  we  arbitrarily  fix  the 
depth  of  the  quadric  from  the  projection  plane, 
then  the  quadric  which  orthograph ically  projects 
onto  a  given  quadratic,  is  completely  determined  up 
to  the  three  degrees  of  freedom  specified  above. 
This  too  is  shown  in  the  Appendix.  The  fact  that 
quadrics  project  onto  quadratics  implies  that  that 
the  local  second  order  descriptions  of  surfaces 
which  project  onto  ellipses  are  either  ellipsoids  or 
hyperboloids  of  one  sheet;  which  project  to  parabo¬ 
las  are  either  elliptic  paraboloids  or  hyperbolic  par¬ 
aboloids;  which  project  to  hyperbolas  are  either 
hyperboloids  of  one  sheet  or  hyperboloids  of  two 
sheets.  The  classification  of  quadrics  can  be  found 
in  [Olmsted,  1947]. 

VI.  Circles 

Theorem  3  :  Circles  in  a  line-drawing  obtained 
under  orthographic  projection  with  a  general 
viewpoint  are  projections  of  circles  in  space  which 
are  confined  to  lie  in  planes  parallel  to  the  image 
plane. 

Proof  :  We  proceed  in  the  same  fashion  as  in  the 
proof  of  the  previous  theorem  except  that  the 
column  with  the  xt  y,  terms  in  P  is  now  absent  and 
so  we  need  only  consider  five  points  instead  of  six. 
We  finally  arrive  at  the  following  requirement 
instead  of  equation  (15). 

[z52  v&  *s  vs  i]  sT(ssT  )-*  t' 

-  (2y 5  l]z5  +  [pi|U  =0  (18) 


where  S  = 


*i2 

yt2 

yi 

1 

xi 

vi 

Zj 

Vi 

1 

*3 

vi 

*3 

v3 

1 

** 

vi 

*4 

V  4 

1 

2y  iZ  i  2 1 
~  z  _  2y  22  2  2  2 
2 y323  2  3 

2  y<Zi  2  4 

and  U  depends  on  (z, ,  y, ),  i  =  1..4. 

Going  through  a  similar  line  of  reasoning  as  before, 
we  may  conclude  that  the  depth  coordinates  are 
of  (byj  +  c  )  form.  But,  if  our  perturbation  was 
about  the  y-axis  instead  of  about  the  x-axis,  we 
would  instead  have  concluded  that  Zj  are  of 
(axj  +  c  )  form.  Because  the  expressions  for  z, 
must  be  the  same  in  both  cases,  it  follows  that  the 
depth  Zj  of  the  points  is  constant.  Finally,  note 
that  curves  confined  to  planes  perpendicular  to  the 
projection  direction,  orthographically  project  onto 
identical  curves.  This  completes  the  proof. 

Q.E.D. 


Corollary  :  Circles  in  a  line-drawing  obtained 
under  orthographic  projection  with  a  general 
viewpoint  are  projections  of  continuous- surface- 
tangent  depth  discontinuities  in  space. 


As  we  did  for  general  conics  above,  we  now 
constrain  the  surface  curvatures  along  the  depth 
discontinuity  curve,  in  this  case  a  circle.  We  once 
again  consider  the  expression  (17)  for  rt .  If  the 
radius  of  the  circle  is  r,  then  without  any  loss  of 
generality  we  can  substitute,  x;  =  x<,  +  r  cos<j>, 
and  y*  =  y0  +  r  sin^, .  This  leads  to  the  result 
that  r,  =  mr  where  m  is  an  arbitrary  multiplica¬ 
tive  factor.  The  conclusion  that  r,  is  constant 
could  also  be  drawn  purely  from  symmetry  con¬ 
siderations. 


Now,  we  point  out  that  the  satisfaction  of  a 
constraint  of  the  type  (12)  is  not  sufficient  to  ensure 
that  the  mapping  under  perturbation  of  viewpoint 
will  remain  a  circle.  We  also  require  that  the 
coefficients  of  the  x2  and  y  2  terms  in  the  quadratic 
remain  equal.  This  requirement,  after  some  alge¬ 
bra,  can  be  shown  to  be  unsatisfiable  under  the 
second  order  approximation  (1).  However,  it  can  be 


950 


verified  using  the  constraints  derived  above  that 
the  multiplicative  factor  m,  in  the  expression  for  r, , 
must  be  unity  if  the  circular  shape  is  to  be  main¬ 
tained  under  perturbation  of  viewpoint.  Thus,  we 
are  lead  to  the  conclusion  that  every  point  on  the 
space  curve  which  projects  to  a  circle  is  an  umbili¬ 
cal  point  (see  [Lipschutz,  1969])  with  radius  of  cur¬ 
vature  equal  to  the  radius  of  the  projected  circle. 
Hence,  the  local  second  order  description  of  the  sur¬ 
face  is  a  sphere  with  the  same  radius  as  that  of  the 
projected  circle.  Notice  that  the  first  and  second 
fundamental  forms  of  the  surface  are  completely 
determined,  without  any  degree  of  freedom,  along 
the  depth  discontinuity  curve. 

VII.  Four  More  Theorems 

Before  we  state  the  theorems,  we  point  out 
that  the  analyses  and  results  presented  throughout 
this  paper  are  valid  even  if  the  straight-lines  and 
conic-sections  in  the  line-drawing  are  fragmented. 

Theorem  4  :  Scene  events  which  orthographically 
project,  under  general  viewpoint,  onto  straight-lines 
or  conic-sections  in  line-drawings  must  either  be 
completely  viewpoint-independent  or  completely 
viewpoint- dependent. 

Theorem  5  Continuous-surface-tangent  depth 
discontinuities  which  orthographically  project,  under 
general  viewpoint,  onto  straight-lines  in  line- 
drawings  can  be  described  locally  by  developable  sur¬ 
faces. 

Theorem  8  :  Continuous-surface-tangent  depth 
discontinuities  which  orthographically  project,  under 
general  viewpoint,  onto  conic-sections  in  line- 
drawings  can  be  described  locally  by  quadric  sur¬ 
faces.  Specifically,  continuous-surface-tangent 
depth  discontinuities  projecting  onto  ellipses  by 
ellipsoids  or  hyperboloids  of  one  sheet,  onto  parabo¬ 
las  by  elliptic  paraboloids  or  hyperbolic  paraboloids, 
and  onto  hyperbolas  by  hyperboloids  of  one  sheet  or 
hyperboloids  of  two  sheets.  Furthermore,  these  qua¬ 
dric  surfaces  are  completely  determined,  up  to  three 
degrees  of  freedom,  by  the  conic-sections  in  the 
line-drawings. 

Theorem  7  :  Depth  discontinuities  which  ortho¬ 
graphically  project,  under  general  viewpoint,  onto 


circles  in  line-drawings  can  be  described  locally  by 
spheres  whose  radii  are  identical  to  the  radii  of  the 
circles  in  the  line-drawings. 

In  Theorem  4,  viewpoint-independent  scene 
events  refer  to  combinations  of  surface-tangent 
discontinuities,  surface-reflectance  discontinuities 
and  illumination  discontinuities,  while  viewpoint- 
dependent  scene  events  refer  solely  to  continuous- 
surface-tangent  depth  discontinuities.  As  men¬ 
tioned  previously,  pairs  of  straight-lines  are  con¬ 
sidered  to  be  instances  of  straight-lines  rather  than 
instances  of  degenerate  conic-sections. 

Theorem  4  immediately  follows  from  the  dis¬ 
cussion  so  far.  Theorem  5  was  proved  in  the  sec¬ 
tion  on  straight-lines.  Theorem  7  follows  from 
Theorem  6  and  the  discussion  in  the  section  on  cir¬ 
cles.  We  now  prove  Theorem  6. 

Proof  of  Theorem  8  :  In  section  V,  it  was  shown 
that  continuous-surface-tangent  depth  discontinui¬ 
ties  which  orthographically  project,  under  general 
viewpoint,  onto  conic-sections  in  line-drawings  can 
be  described  up  to  second  order  properties  by  local 
quadric  surfaces.  We  now  seek  to  show  that  the 
quadric  description  is  actually  valid  for  properties 
of  all  orders,  not  just  the  second  ordsr.  We  use  the 
following  result  from  a  paper  by  Maschke 
[Maschke,  1902]  :  The  developable  surfaces  and  the 
surfaces  of  the  second  order  are  then  the  only  ones 
for  every  point  of  which  a  surface  of  the  second 
order  exists  having  a  contact  of  the  third  order.  It 
follows  from  this  result  that  all  we  need  to  establish 
is  third  order  contact  between  the  projected  surface 
and  a  quadric  at  every  point  local  to  the  depth 
discontinuity  curve  on  the  surface. 

Under,  the  general  viewpoint  assumption  only 
isolated  points  on  the  depth  discontinuity  curves 
may  be  boundary  points  of  a  C  3  patch.  Let  us  con¬ 
sider  a  depth  discontinuity  point  in  the  interior  of  a 
C 3  patch.  With  an  appropriate  choice  of  the  coor¬ 
dinate  frame,  the  surface  may  be  locally  expressed 
by  the  the  following  Taylor  series  expansion  about 
the  origin.  The  coordinate  axes  are  u,  v,  and  w, 
and  the  surface  is  tangent  to  the  u-v  plane  at  the 
origin.  L,  M,  and  N  are  the  coefficients  of  the 
second  fundamental  form  of  the  surface  (see 


951 


(Lipschutz,  1969]). 


w  =  —  (Lu2  +  2Muv  +  Nv2) 

2 

+  i(Pu3  +  3Qu2v  +  3Ruv2  +  St/3) 

+  o([u2  +  t/2]3'2)  (19) 

Let  the  projection  direction  lie  in  the  u-v  plane  and 
make  a  counter-clockwise  angle  $  with  the  u-axis. 
Then  ,  along  the  depth  discontinuity  curve, 


must  be  zero. 


-  sin (0)  (Lu  +Mv  )  +  ^r(Pu  2+2Quv  +Rv2)  +  .. 

(L 

-f  cos(0)  (Mu +Nv  )  +  ^-(Qu2+2Ruv +  Sv2)  +  .. 

At 

==  0  (20) 


This  equation  implicitly  expresses  v  in  terms  of  u. 
Now  we  use  the  fact  that  the  depth  discontinuity 
curve  must  be  planar  in  space.  Equivalently,  the 
torsion  of  the  space  curve  p  =  [u  v  w]  must  be 

zero,  i.e.,  det  ( ]p' p"  p'"]T  )  must  equal  zero  (see 
(Lipschutz,  1969]).  If  we  parameterize  the  curve  in 
terms  of  u,  then  we  will  get 


v"w'"  -  v'"u>"  =  0.  (21) 

We  require  that  this  equation  be  satisfied  at  the  ori¬ 
gin.  If  we  differentiate  equations  (20)  and  (19),  we 
get  the  following  equalities  at  the  origin. 

,  _  L  sin(fl)  -  A/cos(0) 

Ncos(0)  -  A/sin(0) 

„  =  ]P  +  2Qv'  +  BuVjsinjfl) 

N  coo(6)  -  Msin(B) 

[Q  +  2  Rv'  +  St>V|cos(g) 

N cos(B)  -  A/sin(0) 

m  _  3|Q  4-  Pu']sin(g)  -  3]/?  +  St/']cos(fl)  „ 
Ncos(ff)  -  Msin(ff) 

w"  =  (L  +  Mv'  +  Nv'v') 

w'»  =  3(M  +  Nv'jv" 

+  (P  +  3  Qv'  +  3Rv'v'  +  Su'u'u') 

It  follows  from  these  equations  that  v "  or 
(w*-  w^v^/v*),  both  of  which  are  linear  combi¬ 
nations  of  P,  Q,  R,  and  S,  must  be  zero. 
Equivalently,  if  we  put  A  =  P  +  Qv', 


B  =  Q  +  Rv',  and  C  =  R  +  Sv',  then  at  least 
one  of  the  following  two  equations  must  be  true. 

(A  +Bv')sin(0)-(B  +  CV)cos(0)  =  0  (22) 

3 (M  +  Nv1)  [(A  +  Bu')sin(0)  -  (B  +  CV)cos(0)] 
+  (Ncos(8)  -  M sin(0))  [a  +  2 Bv'  +  Cv'v '] 

-  3wH  [fl  sin(6)  -  C  cos(^)]  =  0  (23) 

Notice  that  both  the  equations  are  homogeneous 
linear  equations.  Assume  that  L,  M,  and  N  are 
known  at  the  origin.  Now,  if  perturbation  of  6 
leads  to  only  one  independent  equation,  then  P,  Q, 
R,  and  S  have  three  degrees  of  freedom,  i.e.,  we  can 
choose  three  of  these  coefficients  arbitrarily.  This 
can  only  happen  if  the  depth  discontinuity  curve  is 
locally  fixed  in  space  under  perturbation  of 
viewpoint.  Then,  v'  must  be  constant  and  the  sur¬ 
face  must  therefore  be  parabolic  at  the  origin,  i.e., 
LN  -  M  2  —  0.  In  fact,  the  surface  must  be  locally 
developable.  As  projections  of  such  depth  discon¬ 
tinuities  will  have  zero  curvature  in  the  line¬ 
drawing,  we  need  not  consider  them  here.  Hence, 
the  number  of  linearly  independent  equations  must 
either  be  two  or  three.  If  it  is  two,  then  P,  Q,  R, 
and  S  have  two  degrees  of  freedom,  and  if  it  is 
three,  then  there  is  only  one  degree  of  freedom.  It 
can  easily  be  verified  by  differentiation  that  the 
third  derivatives  of  a  quadric  surface  with  known 
L,  M,  and  N,  and  constrained  to  be  tangential  to 
the  u-v  plane  at  the  origin,  have  two  degrees  of 
freedom.  Further,  we  also  know  that  the  con¬ 
straints  between  P,  Q,  R,  and  S  are  satisfied  by 
every  quadric  surface  because,  as  shown  in  the 
Appendix,  the  depth  discontinuity  curve  on  a  qua¬ 
dric  is  always  planar.  Hence  we  can  always  find  a 
quadric  surface  which  has  third  order  contact  with 
the  projected  surface  at  the  origin. 

We  have  established  third  order  contact 
between  a  quadric  surface  and  the  projected  surface 
at  the  depth  discontinuity  under  consideration.  By 
moving  our  depth  discontinuity  to  any  interior 
point  of  a  C3  surface  patch,  we  can  establish  third 
order  contact  with  a  quadric  surface  at  every  such 
point.  From  Maschke’s  result  it  follows  flat  the 
surface  can  be  locally  described  by  quadric  patches, 
one  for  each  C3  patch  along  the  depth  discon- 


952 


tinuity  curve.  But  we  have  previously  established 
that  a  single  quadric  surface  has  second  order  con¬ 
tact  along  the  whole  depth  discontinuity  curve. 
Further,  it  was  shown  in  the  Appendix  that  this 
quadric  is  completely  determined  up  to  three 
degrees  of  freedom.  The  desired  result  immediately 
follows. 

Q.E.D. 

There  is  a  result  by  Blaschke  (Blaschke,  1923] 
that  the  only  surfaces  which  yield  planar  depth 
discontinuity  curves  under  orthographic  projection 
for  all  viewpoints  are  the  quadrics  and  the  develo¬ 
pable  surfaces.  His  proof  implicitly  assumes  the 
surface  to  be  C4.  In  our  proof  of  Theorem  6,  we 
have  essentially  generalized  Blaschke’s  result  to  the 
following.  All  surfaces  which  yield  planar  depth 
discontinuity  curves  under  orthographic  projection 
for  all  viewpoints  within  an  open  set  on  the  Gaus¬ 
sian  sphere  can  be  described  locally  by  the  quadrics 
and  the  developable  surfaces.  Our  proof  assumes  a 
C  3  surface. 

VTL  Conclusion 

Line-drawing  interpretation  refers  to  the 
attempt  to  infer  3-D  shape  information  about  *he 
scene  from  its  line-drawing.  We  view  the  problem 
to  be  one  of  surface  constraint  generation  under  the 
assumption  of  a  general  viewpoint.  As  mentioned 
previously,  the  general  viewpoint  assumption  is 
essential  to  make  any  headway.  Without  this 
assumption  it  is  not  possible  to  make  non-trivial 
deductions.  In  this  work  we  attempted  to  discover 
constraints  associated  with  straight-lines  and 
conic-sections  in  line-drawings  obtained  under 
orthographic  projection.  Necessary  and  sufficient 
conditions  were  derived  in  each  case. 

Related  previous  work  of  interest  includes 
that  of  Whitney  [Whitney,  1955]  and  Koenderink 
[Koenderink,  1984].  It  follows  from  Whitney’s 
paper  that  there  are  only  two  types  of  generic 
singularities  in  the  projection  of  a  C3  surface  onto 
a  plane.  They  are  what  he  calls  “folds”  and 
“cusps."  "Folds”  are  depth  discontinuity  curves 
and  “cusps”  are  their  terminations.  Whitney's 
results  are  valid  for  both,  perspective  and  ortho¬ 
graphic  projections.  However,  his  genericity  condi¬ 


tion  is  more  relaxed  than  our  general  viewpoint 
condition  in  that  it  demands  stability  under  pertur¬ 
bation  of  both,  the  viewpoint  and  the  projected 
surface.  Koenderink  has  shown  that  a  depth 
discontinuity  is  an  elliptic,  hyperbolic,  or  parabolic 
point  on  the  viewed  surface  depending  on  whether 
its  projected  image  in  the  line-drawing  is  convex, 
concave  or  an  inflection,  respectively.  His  work 
makes  the  same  assumptions  as  Whitney’s. 

Finally,  we  note  that  we  have  made  no  effort 
whatsoever  to  suggest  schemes  for  the  detection  of 
straight-lines  and  conic-sections  in  line-drawings. 
Neither  have  we  analyzed  the  robustness  of  the 
conclusions  which  may  be  drawn  from  such 
instances,  i.e.,  how  sensitive  are  the  surface  predic¬ 
tions  to  errors  in  the  detection  and  parameteriza¬ 
tion  of  curves  in  line-drawings.  Practical  considera¬ 
tions  demand  that  these  issues  be  addressed. 


Appendix 

We  first  show  that  the  orthographic  projec¬ 
tion  of  a  quadric  surface  is  always  a  quadratic 
curve.  To  see  this,  it  is  sufficient  to  vtrify  that  the 
depth  discontinuity  curves  on  quadrics  are  always 
planar.  Let  us  assume  the  projection  geometry  of 
Fig.  3.  Any  quadric  can  always  be  considered  the 
zero  potential  surface  of  some  second  order  poten¬ 
tial  field, 


C(x  ,  y  ,  z  )  =  ax2  +  by2  +  cz2  +  dxy  +  exz 
+  f  yz  +  gx  +  hy  +  iz  +  j  . 
Then  the  normal  to  the  quadric,  C(x,y,z)  =  0,  at 
any  point  is  in  the  direction  of  the  potential  gra¬ 
dient  at  that  point.  Now  observe  that  a  point  on 
the  surface  is  a  depth  discontinuity  if  and  only  if 
the  surface  normal  at  that  point  is  orthogonal  to 
the  projection  direction,  i.e.,  the  z-axis  here. 
Hence,  the  depth  discontinuity  curve  must  satisfy 
the  following  equation. 


dC  (x ,  y ,  z)  „  , 

- L__ - L  =  2cz  +  ex  +  f  y  +  »  =0 

oz 

This  is  the  equation  of  a  plane.  The  orthographic 
projection  of  our  quadric,  C(x,  y,  z)  =  0,  can  be 
simply  obtained  by  substituting  for  z  from  the 
equation  of  the  plane.  The  resulting  curve  is 


953 


immediately  seen  to  be  a  quadratic. 

Now  we  show  that  if  we  arbitrarily  fix  the 
depth  of  the  quadric  surface  from  the  projection 
plane  then  the  surface  is  completely  determined,  up 
to  three  degrees  of  freedom,  by  the  projected  qua¬ 
dratic.  Let  us  assume  that  the  equation  of  the 
plane  containing  the  space  curve  is 
z  +  ax  +  0y  =  0.  This  equation  has  two  degrees 
of  freedom.  Constraining  this  plane  to  pass 
through  the  origin  fixes  the  depth  of  the  quadric 
surface.  Equating  the  coefficients  of  this  plane  with 
those  of  the  plane  derived  above,  we  get 
e  =  2ac  ,  f  =  20c  ,  i  =  0.  Now,  if  we  substitute 
for  z  into  C(x,  y,  z)  =  0,  we  will  get 

(a  -  a2c  )z2  +  (6  -  (Pc  )y2 

+  (d  -  2 a0c  )xy  +  gx  +  hy  +  j  =  0  . 

But  as  we  know  the  equation  of  the  projected  qua¬ 
dratic,  we  can  determine  all  the  coefficients  of  the 
quadric  C(x,  y,  z)  =  0  in  terms  of  one  unknown  c. 
This  unknown  determines  the  scaling  factor  of  the 
radii  of  normal  curvature  along  the  depth  discon¬ 
tinuity  curve  on  the  surface  (see  Section  V). 

Acknowledgement 

The  author  is  deeply  indebted  to  Raz  Stowe 
for  substantial  mathematical  assistance.  He  also 
wishes  to  thank  Tom  Binford  for  his  support.  This 
work  was  supported  in  part  by  the  Defense 
Advanced  Research  Projects  Agency  under  contract 
N00039-84-C-02 1 1 . 


Hilbert,  D.,  and  Cohn-Vossen,  S.,  Geometry  and 
the  Imagination,  Chelsea,  New  York,  1952. 

Koenderink,  J.J.,  “What  does  the  occluding  con¬ 
tour  tell  us  about  solid  shape?,”  Perception,  Vol. 
13,  1984,  321-330. 

Lipschutz,  M.M.,  Differential  Geometry,  McGraw- 
Hill,  New  York,  1969. 

Maschke,  H.,  “On  Superosculating  Quadric  Sur¬ 
faces,"  Transactions  of  the  American  Mathemati¬ 
cal  Society,  Vol.  3,  1902,  482-484. 

Olmsted,  J.M.H.,  Solid  Analytic  Geometry, 
Appleton-Century-Crofts,  New  York,  1947. 

Strang,  G.,  Linear  Algebra  and  Its  Applications, 
Academic  Press,  New  York,  1976. 

Whitney,  H.,  “On  Singularities  of  Mappings  of 
Euclidean  Spaces.  I.  Mappings  of  the  Plane  into 
the  Plane,"  Annals  of  Mathematics,  Vol.  62,  Nov. 
1955,  374-410. 


Since  submitting  the  paper,  the  author 
has  discovered  that  the  proof  of  theorem  6 
is  incorrect.  Correct  proof  will  be 
presented  in  a  subsequent  version. 


References 

Blaschke,  W.,  Vorlesungen  iiber  Differential- 
geometrie  II,  Springer,  Berlin,  1923. 

Coddington,  E.A.,  An  Introduction  to  Ordinary 
Differential  Equations,  Prentice-Hall,  New  Jersey, 
1961. 

Coxeter,  H.S.M.,  Introduction  to  Geometry,  John 
Wiley  &.  Sons,  New  York,  1961. 

Golubitsky,  M.,  and  Guillemin,  V.,  Stable  Map¬ 
pings  and  Their  Singularities,  Springer-Verlag, 
New  York,  1973. 

Guillemin,  V.,  and  Pollack,  A.,  Differential  Topol¬ 
ogy,  Prentice-Hall,  New  Jersey,  1974. 


954 


Fig.  1.  Line-Drawing  with  Various  Types  of  Edges  Fig.  3.  The  Projection  Geometry 


Fig.  2.  A  Depth  Discontinuity  with  a  Continuous  Tangent  in  a  Fig.  4.  Change  in  Depth  Discontinuity  under 

Line-Drawing  need  not  have  a  Continuous  Tangent  in  Perturbation  of  Viewpoint 

Space 


955 


Line  -Drawing  Interpretation  : 
Bilateral  Symmetry 

Vic  Nalwa 

A. I.  Lab.,  Stanford  University ,  CA  94305 


Abstract 

Symmetry  manifests  itself  in  numerous  forms 
throughout  nature  and  the  man-made  world.  It  is 
frequently  encountered  in  line-drawings  correspond¬ 
ing  to  edges  in  images.  We  seek  to  discover  con¬ 
straints  on  the  imaged  surfaces  from  instances  of 
bilateral  (reflective)  symmetry  in  line-drawings 
obtained  from  orthographically  projected  images.  A 
general  viewpoint  is  assumed,  i.e.,  it  is  assumed  that 
the  mapping  of  the  viewed  surface  onto  the  line¬ 
drawing  is  stable  under  perturbation  of  the 
viewpoint  within  some  open  set  on  the  Gaussian 
sphere.  It  is  shown  that  the  only  planar  figures 
which  exhibit  bilateral  symmetry  under  orthographic 
projection  with  a  general  viewpoint  are  ellipses  and 
straight-line  segments. 

We  show  that  the  orthographic  projection  oj  a 
surface  of  revolution  exhibits  bilateral  symmetry 
about  the  projection  of  the  axis  of  revolution, 
irrespective  of  the  viewing  direction.  Further,  it  is 
shown  that  whenever  the  symmetry  axis  is  the  pro¬ 
jection  of  a  line  in  space  which  is  invariant  under 
perturbation  of  the  viewpoint,  the  bilaterally  sym¬ 
metric  line-drawing  is  the  orthographic  projection  of 
a  local  surface  of  revolution.  The  axis  of  revolution 
is  the  back-projection  of  the  symmetry  axis.  Vari¬ 
ous  line-drawing  configurations  for  which  the  back- 
projection  is  invariant  are  detailed.  It  is  conjec¬ 
tured  that  isolated  ellipses  and  isolated  straight-line 
segments  are  the  only  bilaterally  symmetric  line- 
drawings  whose  symmetry  axes  are  not  projections 
of  lines  in  space  which  are  invariant  under  pertur¬ 
bation  of  the  viewpoint.  Throughout,  only  surface 
regions  local  to  the  scene-events  corresponding  to 
the  lines  are  constrained. 


I.  Introduction 

Line-drawings  are  representations  of  edges  in 
an  image.  An  edge  in  an  image  may  be  caused  by 
various  events  in  the  scene.  The  scene-event  may 
be  a  surface-tangent  discontinuity,  a  depth  discon¬ 
tinuity,  a  surface-reflectance  discontinuity  or  an 
illumination  discontinuity  (shadow).  Fig.  1  illus¬ 
trates  the  various  cases.  Notice  that  a  depth 
discontinuity  may  also  simultaneously  be  a 
surface-tangent  discontinuity.  We  distinguish  this 
from  a  continuous-surface-tangent  depth  discon¬ 
tinuity.  While  surface-tangent  discontinuities, 
illumination  discontinuities,  and  surface-reflectance 
discontinuities  constitute  viewpoint-independent 
scene-edges,  continuous-surface-tangent  depth 
discontinuities  are  viewpoint-dependent. 

An  object  exhibits  symmetry  whenever  it  can 
be  divided  into  two  or  more  parts  which  can  be 
permuted  by  the  application  of  certain  isometries 
which  leave  the  original  object  unchanged.  Sym¬ 
metry  manifests  itself  in  numerous  forms 
throughout  nature  and  the  man-made  world.  Some 
examples  are  shown  in  Fig.  2.  The  two  symmetry 
operations  for  planar  figures  are  reflection  and  rota¬ 
tion.  The  letter  A  is  an  instance  of  a  figure  with 
reflective  symmetry  and  the  letter  Z  exhibits  rota¬ 
tional  symmetry.  The  letter  H  admits  both  sym¬ 
metry  operations.  Three  dimensional  space  admits 
several  symmetry  operations  in  addition  to 
reflection  and  rotation,  e.g.,  inversion  symmetry  is 
exhibited  by  a  pair  of  bicycle  cranks.  In  this 
presentation  we  are  concerned  with  the  three 
dimensional  implications  of  bilateral  (reflective) 
symmetry  in  a  line-drawing. 

Line-drawings,  like  images,  are  many-to-one 
mappings  from  a  3-D  domain  to  a  2-D  range 


956 


Although  the  depth  information  is  lost  under  pro¬ 
jection,  we  can  nevertheless  impose  some  con¬ 
straints  on  the  3-D  scene  by  investigating  the  2-D 
configuration  of  the  corresponding  line-drawing. 
We  restrict  our  attention  to  orthographic  projection 
here.  Orthographic  projection  is  a  reasonable 
approximation  to  perspective  projection  whenever 
the  angle  subtended  by  the  object  at  the  viewpoint 
is  small.  Ortbographically  projected  line-drawings 
of  man-made  scenes  often  exhibit  instances  of  bila¬ 
teral  symmetry.  We  investigate  constraints 
imposed  on  the  scene  by  such  instances  under  the 
assumption  of  general  viewpoint,  i.e.,  under  the 
assumption  that  the  mapping  of  the  viewed  surface 
onto  the  line-drawing  is  stable  under  perturbation 
of  the  viewpoint  within  some  open  set  on  the  Gaus¬ 
sian  sphere1.  The  stability  we  demand  includes  the 
requirement  that  the  bilateral  symmetry  exhibited 
by  the  line-drawing  be  preserved  under  perturba¬ 
tion  of  viewpoint.  Other  properties  of  the  mapping 
with  respect  to  which  we  require  stability  will  be 
mentioned  as  we  proceed.  As  long  as  the  total 
number  of  such  requirements  is  finite,  the  set  of 
viewing  directions  on  the  Gaussian  sphere  where 
the  general  viewpoint  assumption  is  invalid  is 
closed  and  has  measure  zero2.  It  must  be 
emphasized  here  that  we  are  not  seeking  ad  hoc 
hueristics.  We  seek  to  discover  rigorous  constraints 
under  specific  reasonable  assumptions.  It  should  be 
noted  that,  given  the  loss  of  information  about  the 


1  The  Gaussian  sphere  is  a  device  to  represent  orienta¬ 
tion  in  space.  Each  point  on  a  unit  sphere,  called  the 
Gaussian  sphere  (see  [Hilbert  and  Cohn-Vossen,  1852]), 
corresponds  to  the  unit  vector  from  the  center  of  the 
sphere  to  that  point. 

1  An  arbitrary  set  in  any  manifold  has  measure  jero  if 
its  preimage,  for  every  local  parametrisation,  can  be 
covered  by  a  countable  number  of  rectangular  solids  with 
an  arbitrarily  small  total  volume  in  Euclidean  space.  For 
our  purposes  it  suffices  to  note  that  the  Gaussian  sphere  is 
a  two  dimensional  manifold  and  consequently  all  sets  of 
points  and  lines  on  it  have  measure  sero.  The  null  set  of 
coarse  has  measure  sero.  On  the  other  hand,  no  open  set 
on  the  Gaussian  sphere  has  measure  sero. 

It  can  be  shown  that  the  union  of  a  countable 
namber  of  sets  of  measure  sero  is  also  of  measure  zero  (see 
(Gnillemin  and  Pollack,  1874]).  Therefore,  the  union  of  a 
Suite  namber  of  closed  sets  of  measure  zero  is  closed  and 
of  measure  zero.  The  complement  of  such  a  union  is  open 
and  of  fall  measure. 


third  dimension  under  projection,  one  cannot  hope 
to  make  any  substantial  deductions  about  the 
imaged  surfaces  without  the  general  viewpoint 
assumption. 

Related  previous  work  of  interest  includes 
that  of  Kanade  [Kanade,  1981]  and  Marr  [Marr, 
1977],  Kanade  presents  a  heuristic  which  infers 
symmetry  in  the  scene  whenever  skewed  symmetry 
is  discovered  in  an  orthographically  projected  line¬ 
drawing  of  a  world  containing  only  planar  surfaces. 
Skewed  symmetry  is  meant  to  indicate  planar 
shapes  in  which  symmetry  is  exhibited  along  lines 
at  a  fixed  angle  to  the  axis  of  symmetry,  e.g.,  see 
Fig.  3.  Of  course,  if  this  fixed  angle  is  a  right  angle 
then  we  have  bilateral  symmetry.  Although  planar 
bilateral  symmetries  do  map  onto  skewed  sym¬ 
metries,  skewed  symmetries  also  map  onto  skewed 
symmetries.  Hence,  Kanade’s  inference  requires 
skewed  symmetries  to  be  excluded  from  his  world 
of  planar  surfaces.  Marr  [Marr,  1977]  uses  “qualita¬ 
tive  symmetry”  to  deduce  generalized  cones,  i.e., 
solids  whose  cross-sections  perpendicular  to  an  axis 
are  all  scaled  versions  of  one  another.  His  concept 
of  qualitative  symmetry,  although  more  relaxed 
than  our  concept  of  bilateral  symmetry,  does  not 
lead  to  the  conclusions  he  wishes  to  draw  under  the 
general  viewpoint  paradigm  presented  here.  Our 
work  is  more  in  the  spirit  of  Koenderink  and  van 
Doom  who  have  shown  in  two  elegant  articles  that 
the  Gaussian  curvature  of  a  body  can  be  deduced 
from  the  curvature  of  the  occluding  contour  [Koen¬ 
derink,  1984]  and  that  terminations  of  occluding 
contours  are  always  concave  [Koenderink  and  van 
Doom,  1982]. 

In  Section  II  we  show  that  the  line-drawing  of 
an  orthographically  projected  surface  of  revolution 
exhibits  bilateral  symmetry  about  the  projection  of 
the  axis  of  revolution,  irrespective  of  the  viewing 
direction.  Section  III  investigates  whether  local 
surfaces  of  revolution  are  necessary  for  bilateral 
symmetry  in  3  line-drawing  obtained  under  general 
orthographic  projection.  It  is  shown  that  whenever 
the  symmetry  axis  is  the  projection  of  a  line  in 
space  which  is  invariant  under  perturbation  of  the 
viewpoint,  the  bilaterally  symmetric  line-drawing  is 


957 


the  orthographic  projection  of  a  local  surface  of 
revolution.  The  axis  of  revolution  is  the  back- 
projection  of  the  symmetry  axis.  Various  line¬ 
drawing  configurations  for  which  the  back- 
projection  is  invariant  are  detailed.  It  is  conjec¬ 
tured  that  isolated  ellipses  and  isolated  straight-line 
segments  are  the  only  bilaterally  symmetric  line- 
drawings  whose  symmetry  axes  are  not  projections 
of  lines  in  space  which  are  invariant  under  pertur¬ 
bation  of  the  viewpoint.  Throughout,  only  surface 
regions  local  to  the  scene-events  corresponding  to 
the  lines  are  constrained.  We  conclude  with  Sec¬ 
tion  IV. 

The  Appendix  contains  a  proof  for  the  propo¬ 
sition  that  segments  of  straight-lines  and  conic- 
sections  are  the  only  planar  curves  whose  ortho¬ 
graphic  projections  exhibit  local  bilateral  symmetry 
under  general  viewpoint.  It  immediately  follows 
that  ellipses  and  straight-line  segments  are  the  only 
finite  planar  curves  whose  orthographic  projections 
exhibit  (global)  bilateral  symmetry  under  general 
viewpoint. 

II.  Surface  of  Revolution  :  A  Sufficient 
Condition. 

It  is  easy  to  see  that  an  object  displaying  bila¬ 
teral  symmetry  in  space  need  not  exhibit  bilateral 
symmetry  under  orthographic  projection  (e.g.,  Fig. 
2-c).  Fig.  3  shows  that  bilaterally  symmetric 
planar  curves,  in  general,  project  to  what  Kanade 
[Kanade,  1981]  calls  skewed  symmetric  contours. 
As  shown  in  the  Appendix,  ellipses  and  straight-line 
segments  are  the  only  exceptions  to  this 
phenomenon  in  that  they  project  to  bilaterally  sym¬ 
metric  curves. 

We  now  seek  to  argue  that  the  orthographic 
projection  of  any  surface  of  revolution,  i.e.,  a  sur¬ 
face  generated  by  revolving  a  curve  (called  the  gen¬ 
erator)  about  an  axis,  will  always  exhibit  bilateral 
symmetry  about  the  projection  of  the  axis  of  revo¬ 
lution.  Consider  such  a  surface  to  have  its  axis 
aligned  with  an  arbitrary  vector  in  space.  Fig.  4 
shows  one  such  vector.  Its  orientation  is  com¬ 
pletely  determined  by  the  specification  of  its 
azimuth  angle,  i.e.,  the  counter-clockwise  angle 
made  by  its  projection  onto  the  x-y  plane  with  the 


x  axis,  and  its  polar  angle,  i.e.,  the  angle  it  makes 
with  the  z  axis.  It  is  easily  seen  that  the  symmetry 
properties  of  the  projection  onto  the  x-y  plane  are 
not  affected  by  the  azimuth  angle.  Hence,  without 
loss  of  generality  we  can  always  consider  it  to  be 
zero.  Now,  if  we  note  that  the  z-x  plane  divides 
the  surface  of  revolution  into  two  mirror  images,  we 
can  immediately  deduce  that  the  orthographic  pro¬ 
jection  will  exhibit  bilateral  symmetry  about  the  x 
axis.  Fig.  5  shows  an  example  of  the  orthographic 
projection  of  a  surface  of  revolution.  The  mirror 
line  is  the  projection  of  the  axis  of  revolution. 

The  above  analysis  assumes  that  the  surface 
of  revolution  is  not  obscured  by  some  other  object 
and  that  it  has  no  shadow  across  it.  Also,  surface 
marks,  if  any,  must  be  uniform  over  sub-surfaces 
which  themselves  are  surfaces  of  revolution  with 
the  same  axis  as  the  complete  surface. 

III.  Surface  of  Revolution  :  A  Necessary 
Condition? 

Is  a  surface  of  revolution  a  necessary  condi¬ 
tion  for  a  bilaterally  symmetric  orthographic  pro¬ 
jection  under  general  viewpoint?  Obviously  not. 
We  can  imagine  arbitrary  surfaces,  without  any 
scene-edges,  corresponding  to  regions  of  a  sym¬ 
metric  line-drawing  which  are  “distant”  from  the 
lines.  These  surfaces  can  always  be  chosen  such 
that  small  arbitrary  perturbations  of  the  viewpoint 
do  not  violate  bilateral  symmetry  in  the  projection. 
As  no  information  about  these  regions,  other  than 
the  absence  of  edges,  is  available  in  the  line¬ 
drawing,  it  is  not  possible  to  deduce  stronger  con¬ 
straints  for  such  regions  without  making  assump¬ 
tions  regarding  the  nature  of  the  imaged  world. 
Therefore,  we  restrict  our  attention  to  the  regions 
of  surfaces  local  to  the  edges. 

We  now  argue  that  whenever  the  symmetry 
axis  is  the  projection  of  a  line  in  space  which  is 
invariant  under  perturbation  of  the  viewpoint,  the 
bilaterally  symmetric  line-drawing  is  the  ortho¬ 
graphic  projection  of  a  local  surface  of  revolution. 
The  axis  of  revolution  is  the  back-projection  of  the 
symmetry  axis.  Let  us  assume  that  the  back- 
projection  of  the  symmetry  axis  is  oriented  along 
the  vector  shown  in  Fig.  4  and  that  once  again  the 


958 


orthographic  projection  is  along  the  z-axis  and  that 
the  azimuth  angle  is  zero.  Now  we  point  out  that 
the  general  viewpoint  assumption  can  be  translated 
to  the  requirement  that  bilateral  symmetry  be 
preserved  under  perturbation  of  the  imaged  surface 
position.  It  is  sufficient  to  consider  combinations  of 
two  such  perturbations  :  one,  rotation  of  the  sur¬ 
face  about  the  y-axis  —  we  call  this  tilting,  and 
two,  rotation  of  the  surface  about  the  back- 
projection  of  the  symmetry  axis  —  we  call  this  sim¬ 
ply  rotation. 

First,  we  argue  that  if  the  orthographic  pro¬ 
jection  of  the  viewed  surface  is  to  remain  bila¬ 
terally  symmetric  under  tilting,  then  the  surface 
must  exhibit  local  bilateral  symmetry  in  space 
about  the  x-z  plane.  To  see  this,  consider  the  inter¬ 
sections  of  the  viewed  surface  with  any  two  planes 
parallel  to  the  x-z  plane  and  passing  through  a  pair 
of  bilaterally  symmetric  points  in  the  line-drawing. 
Every  such  pair  of  intersection  curves  must  be 
locally  symmetric  about  the  x-z  plane.  If  the 
planes  do  not  intersect  the  viewed  surface,  but  are 
rather  tangential  to  it,  then  the  two  tangent  points 
or  straight-line  segments  must  also  be  bilaterally 
symmetric.  Hence,  the  viewed  surface  must  exhibit 
local  bilateral  symmetry.  Further,  because  we 
could  always  conceive  of  a  surface  perturbation 
which  is  a  rotation  followed  by  tilting,  we  require 
that  the  surface  maintain  its  local  bilateral  sym¬ 
metry  about  the  x-z  plane  when  it  is  subject  to 
rotation.  But  this  requirement  can  only  be  satisfied 
by  local  surfaces  of  revolution  about  the  back- 
projection  of  the  symmetry  axis.  To  see  this,  con¬ 
sider  the  following  argument.  For  a  surface  to  be 
bilaterally  symmetric  about  a  plane,  its  intersection 
with  any  plane  orthogonal  to  the  symmetry  plane 
must  also  exhibit  bilateral  symmetry.  In  particular, 
consider  the  planes  which  have  their  normals 
oriented  along  the  back- projection  of  the  symmetry 
axis.  The  requirement  that  bilateral  symmetry  of 
the  surface  be  maintained  under  rotation  is 
equivalent  to  the  requirement  that  all  the  described 
intersections  maintain  their  symmetry  in  space.  As 
the  only  such  planar  curves  are  circles,  the  argu¬ 
ment  is  complete. 


The  result  just  derived  is  of  practical 
significance  only  if  we  have  mechanisms  to  deduce 
invariance  of  the  back-projections  of  the  symmetry 
axis  from  the  line-drawing  itself.  As  wc  will  use 
results  derived  in  [Nalwa,  1987]  for  this  purpose,  we 
need  to  to  make  the  same  assumptions  about  the 
viewed  surfaces  as  are  made  there.  In  particular, 
we  assume  that  the  projected  surfaces  are  piecewise 
C 3,  i.e.,  the  surfaces  comprise  finitely  many  C’3 
patches,  each  bounded  by  a  finite  piecewise  C3 
curve.  This  restriction  does  not  significantly  con¬ 
strain  the  domain  as  long  as  we  do  not  require  a 
priori  knowledge  of  the  C3  surface-patch  boun¬ 
daries. 

The  piecewise  n3  assumption  enables  us  to 
draw  the  following  conclusions  (see  [Nalwa,  1987]). 
A  straight-line  in  a  line-drawing  is  either  the  pro¬ 
jection  of  a  straight  viewpoint-independent  edge  or 
of  a  straight  viewpoint-dependent  edge.  If  the  edge 
is  viewpoint-dependent  then  it  can  be  locally 
described  by  a  quadric  cylinder  or  a  quadric  cone. 
An  elliptical  segment  in  a  line-drawing  is  either  the 
projection  of  an  elliptical  viewpoint-independent 
edge  or  of  an  elliptical  viewpoint-dependent  edge. 
If  the  edge  is  viewpoint-dependent  then  it  can  be 
locally  described  by  an  ellipsoid  or  a  hyperboloid  of 
one  sheet.  These  conclusions  hold  even  if  the 
straight-line  and  elliptical  segments  in  the  line- 
drawing  are  fragmented. 

Now  we  detail  various  line-drawing 
configurations  for  which  the  back-projection  of  the 
symmetry  axis  is  invariant  under  perturbation  of 
the  viewpoint.  Case  I.  Consider  a  bilaterally  sym¬ 
metric  line-drawing  which  has  two  parallel 
straight-line  segments,  one  on  either  side  of  the 
symmetry  axis.  For  stable  bilateral  symmetry  the 
two  straight  scene-edges  must  be  viewpoint- 
dependent  and  further  must  locally  lie  on  a  single 
right-circular  cylinder.  The  back-projection  of  the 
symmetry  axis  is  the  axis  of  this  cylinder.  Case  II. 
Consider  a  bilaterally  symmetric  line-drawing 
which  has  two  non-parallel  straight-line  segments, 
one  on  either  side  of  the  symmetry  axis.  For  stable 
bilateral  symmetry  the  two  straight  scene-edges 
must  be  viewpoint-dependent  and  further  must 


959 


locally  lie  on  a  single  right-circular  cone.  The 
back-projection  of  the  symmetry  axis  is  the  axis  of 
this  cone.  Case  111.  Consider  a  bilaterally  sym¬ 
metric  line-drawing  which  has  an  elliptical  segment 
(may  be  fragmented)  bisected  by  the  symmetry 
axis.  It  is  easy  to  see  that  the  back-projection  of 
the  centroid  of  the  (completed)  ellipse  must  con¬ 
tinue  to  lie  on  the  back-projection  of  the  symmetry 
axis  under  perturbation  of  the  viewpoint.  Simi¬ 
larly,  a  tangent-discontinuity  on  the  symmetry  axis 
in  a  line-drawing  must  be  the  projection  of  a  coni¬ 
cal  tip  in  space  which  continues  to  lie  on  the  back- 
projection  of  the  symmetry  axis  under  perturbation 
of  the  viewpoint.  As  any  two  points  in  space  deter¬ 
mine  a  line,  two  independent  events  from  among 
ellipses  bisected  by  the  symmetry  axis  and 
tangent-discontinuities  on  the  symmetry  axis  fix  the 
back-projection  of  the  symmetry  axis.  Further 
investigation  would  almost  certainly  lead  to  more 
relaxed  conditions  for  the  invariance  of  the  back- 
projection  of  the  symmetry  axis.  Fig.  6  provides 
several  examples  of  bilaterally  symmetric  line- 
drawings  for  which  we  can  presently  deduce  local 
surfaces  of  revolution. 

It  is  our  conjecture  that  isolated  ellipses  and 
isolated  straight-line  segments  are  the  only  bila¬ 
terally  symmetric  line-drawings  whose  symmetry 
axes  are  not  projections  of  lines  in  space  which  are 
invariant  under  perturbation  of  the  viewpoint. 
What  we  do  know  is  that  ellipses  and  straight-line 
segments  are  the  only  planar  figures  which  exhibit 
bilateral  symmetry  under  orthographic  projection 
with  a  general  viewpoint  (see  Appendix).  Ellipses 
project  onto  ellipses  and  straight-line  segments  onto 
straight-line  segments.  It  follows  that  all 
viewpoint-independent  scene-edges  which  are  iso¬ 
lated  ellipses  or  isolated  straight-line  segments  exhi¬ 
bit  bilateral  symmetry  under  projection.  These,  of 
course,  are  not  local  surfaces  of  revolution.  The 
viewpoint-dependent  scene-edges  which  project 
onto  isolated  ellipses  are  either  local  ellipsoids  or 
local  hyperboloids  of  one  sheet  (see  [Nalwa,  1987)). 
These,  in  general,  are  also  not  local  surfaces  of 
revolution.  From  the  result  in  [Koenderink  and 
van  Doom,  1982],  that  terminations  of  occluding 
contours  are  always  concave,  it  follows  that  no 


viewpoint-dependent  scene-edge  may  project  onto 
an  isolated  straight-line  segment. 

IV.  Conclusion 

Line-drawing  interpretation  refers  to  the 
attempt  to  infer  3-D  shape  information  about  the 
scene  from  its  line-drawing.  This  line  of  work  has  a 
long  tradition  in  both  computer  vision  (see  [Malik, 
1985))  and  psychology  (e.g.,  [Attneave  and  Frost, 
1969]).  Most  of  the  work  in  computer  vision  has 
concentrated  on  exhaustively  listing  the  junction 
possibilities  in  a  polyhedral  world,  although 
recently  there  have  been  attempts  to  handle  simple 
curved  objects  too  (e.g.,  [Malik,  1985]).  Relatively 
few  attempts  have  been  made  at  exploiting  the 
structure  of  curves  in  line-drawings  (e.g.,  [Lowe  and 
Binford,  1984]). 

We  view  the  line-drawing  interpretation  prob¬ 
lem  to  be  one  of  constraint-specification,  i.e.,  gen¬ 
eration  of  all  possible  constraints  which  can  be 
deduced  from  a  line-drawing  under  the  general 
viewpoint  assumption.  This,  in  our  view,  should  be 
followed  by  an  attempt  to  seek  3-D  spatial 
configurations  compatible  with  the  specified  con¬ 
straints.  In  order  to  limit  the  possibilities  it  may 
be  necessary  to  introduce  some  restrictions  on  the 
world  at  this  stage  or  to  introduce  the  concept  of 
“simple”  and  then  try  to  find  the  “simplest”  con¬ 
sistent  interpretation. 

In  this  work  we  have  attempted  to  discover 
the  constraints  associated  with  bilateral  symmetry 
in  a  line-drawing.  It  was  shown  that  the  ortho¬ 
graphic  projection  of  a  surface  of  revolution  exhi¬ 
bits  bilateral  symmetry  about  the  projection  of  the 
axis  of  revolution,  irrespective  of  the  viewing  direc¬ 
tion.  Further,  it  was  shown  that  whenever  the 
symmetry  axis  is  the  projection  of  a  line  in  space 
which  is  invariant  under  perturbation  of  the 
viewpoint,  the  bilaterally  symmetric  line-drawing  is 
the  orthographic  projection  of  a  local  surface  of 
revolution.  The  axis  of  revolution  is  the  back- 
projection  of  the  symmetry  axis.  Various  line¬ 
drawing  configurations  for  which  the  back- 
projection  is  invariant  were  detailed.  It  was  conjec¬ 
tured  that  isolated  ellipses  and  isolated  straight-line 
segments  are  the  only  bilaterally  symmetric  line- 


960 


drawings  whose  symmetry  axes  are  not  projections 
of  lines  in  space  which  are  invariant  under  pertur¬ 
bation  of  the  viewpoint.  Throughout,  only  surface 
regions  local  to  the  scene-events  corresponding  to 
the  lines  were  constrained. 

It  might  interest  the  reader  to  learn  that  this 
work  was  prompted  by  the  observation  in  (Nalwa, 
1987)  that  surfaces  which  orthographically  project, 
under  general  viewpoint,  onto  circles  in  line- 
drawings  can  be  described  locally  by  spheres. 

Finally,  we  note  that  we  have  made  no  effort 
whatsoever  to  suggest  schemes  for  the  detection  in 
line-drawings  of  bilateral  symmetry,  in  general,  and 
ellipses  and  straight-lines,  in  particular.  Neither 
have  we  analyzed  the  robustness  of  the  conclusions 
which  may  be  drawn  from  bilateral  symmetry  in 
line-drawings,  i.e.,  whether  approximate  symmetry 
translates  to  approximate  local  surfaces  of  revolu¬ 
tion.  Practical  considerations  demand  that  these 
issues  be  addressed. 

Appendix  :  Bilateral  Symmetry  in  Ortho¬ 
graphic  Projections  of  Planar  Curves 

Theorem  :  Segments  of  straight-lines  and  conic- 
sections,  i.e.,  ellipses,  parabolas,  and  hyperbolas, 
arc  the  only  planar  curves  whose  orthographic  pro¬ 
jections  exhibit  local  bilateral  (reflective)  symmetry 
under  general  viewpoint. 

Corollary  1  :  Curves  whose  orthographic  projec¬ 
tions  exhibit  local  bilateral  symmetry  under  general 
viewpoint  are  planar  if  and  only  if  their  projections 
are  conic-sections  or  straight-lines. 

Corollary  2  :  Ellipses  and  straight-line  segments 
are  the  only  finite  planar  curves  whose  orthographic 
projections  exhibit  bilateral  symmetry  under  general 
viewpoint. 

It  is  assumed  throughout  this  appendix  that 
the  curves  under  consideration  are  piecewise  ana¬ 
lytic,  i.e.,  the  curves  are  comprised  of  finitely  many 
analytic  segments.  This  restriction  does  not  in  any 
significant  way  constrain  the  domain  as  long  as  we 
do  not  require  a  priori  knowledge  of  the  analytic 
segment  end-points. 

Note  that  the  theorem  refers  to  local  bilateral 


symmetry,  i.e.  reflective  symmetry  exhibited  on 
some  open  interval  of  the  curve-segment.  Corollary 
2,  on  the  other  hand,  refers  to  symmetry  exhibited 
by  the  complete  curve.  It  is  assumed  in  the  state¬ 
ment  of  the  theorem  that  a  pair  of  straight  lines 
meeting  at  a  point  is  a  special  case  of  a  hyperbolic 
branch.  It  is  of  interest  to  note  that  the  ortho¬ 
graphic  projections  of  conic-sections  and  straight¬ 
lines  exhibit  global  bilateral  symmetry  under  every 
viewpoint,  not  just  some  general  viewpoint.  Corol¬ 
lary  1  follows  from  the  fact  that  the  orthographic 
projection  of  a  curve  under  general  viewpoint  is  a 
conic-section  or  straight-line  if  and  only  if  the  origi¬ 
nal  curve  is  a  conic-section  or  straight-line. 

It  is  our  conjecture  that  the  planarity  restric¬ 
tion  from  Corollary  2  may  be  removed,  i.e.,  we 
should  be  able  to  assert  that  ellipses  and  straight- 
line  segments  are  the  only  finite  space  curves  whose 
orthographic  projections  exhibit  bilateral  symmetry 
under  general  viewpoint.  However,  the  planarity 
restriction  cannot  be  removed  from  the  Theorem. 
The  circular  helix  is  an  example  of  a  non-planar 
space  curve  whose  segments  exhibit  loc  i  bilateral 
symmetry  under  orthographic  projection  wiih  a 
general  viewpoint. 

Fig.  7  shows  a  curve-segment  whose  ortho¬ 
graphic  projection  under  general  viewpoint  is 
assumed  to  exhibit  local  bilateral  symmetry.  It  is 
further  assumed  that  for  some  '  iewpoint,  the 
back-projection  of  the  mirror  axis  is  M.  It  inter¬ 
sects  the  curve  at  (x,y).  It  is  easy  to  see  that  the 
tangent,  T,  at  (x,y)  will  project  to  the  tangent  at 
the  projection  of  (x,y).  As  the  tangent  at  the  pro¬ 
jection  of  (x,y)  is  perpendicular  to  the  mirror  axis, 
any  line  parallel  to  this  tangent  will  intersect  (if  at 
all)  the  projected  curve  at  equal  distances  from  the 
mirror  line.  Because,  distances  in  any  one  direction 
on  the  object  plane  are  uniformly  distorted  under 
projection,  it  follows  that  any  line  parallel  to  T  will 
intersect  (if  at  all)  the  original  curve  at  two  points, 
say  (x  j,y,)  and  (x2,y2),  whose  midpoint  lies  on 
M.  This  is  a  necessary  and  sufficient  condition  for 
the  projected  curve  to  exhibit  local  bilateral  sym¬ 
metry  about  the  projection  of  M  for  some 
viewpoint.  It  has  been  assumed  here  that  the 


961 


tangent,  T,  at  (x,y)  is  well  defined,  i.e.  the  curve  is 
C  1  at  (x,y).  If  this  is  not  true,  then  the  above  con¬ 
dition  can  be  relaxed  so  that  we  require  that  every 
line  parallel  to  some  arbitrary  line,  T’,  intersect  (if 
at  all)  the  original  curve  at  two  points  such  that 
their  midpoint  lies  on  M. 

Lemma  1  :  With  the  exception  of  a  pair  of 
straight-lines  meeting  at  a  point,  planar  curves 
whose  orthographic  projections  exhibit  local  bilateral 
symmetry  under  general  viewpoint  are  analytic  and 
without  inflections. 

Proof  :  Non-analytic  and  inflection  points  on  any 
curve  have  a  one-to-one  correspondence  with  non- 
analytic  and  inflection  points  on  the  projected 
curve  as  long  as  the  viewing  direction  does  not  lie 
in  the  plane  of  the  original  curve.  Such  viewing 
directions  are  excluded  by  the  general  viewpoint 
assumption.  For  the  projected  figure  to  exhibit 
bilateral  symmetry  we  must  be  able  to  pair  up 
non-analytic  points  and  inflection  points  such  that 
the  pairs  comprise  of  mirror  images  under  projec¬ 
tion.  Every  inflection  must  belong  to  one  such 
pair.  Non-analytic  points,  however,  may  also  lie  on 
the  mirror  axis  itself.  Bilateral  symmetry  requires 
that  the  mirror  axis  bisect  the  straight-lines  joining 
the  projections  of  the  paired  non-analytic 
(inflection)  points.  It  follows  that  that  the  back- 
projection  of  the  mirror  axis  must  bisect  the 
straight-lines  joining  the  paired  non-analytic 
(inflection)  points  in  the  object  plane. 

It  is  easy  to  see  that  in  the  presence  of  non- 
acalytic  (inflection)  points,  the  back-projection  of 
the  mirror  axis,  if  any,  would  be  fixed  in  all  cases 
except  when  there  is  either  a  single  instance  of 
paired  non-analytic  (inflection)  points  or  an  isolated 
non-analytic  point  on  the  mirror  axis.  With  a  fixed 
mirror  axis,  the  curve  cannot  exhibit  bilateral  sym¬ 
metry  under  general  viewpoint  because  the  angles 
formed  at  the  intersection  of  the  curve  with  the 
mirror  axis  cannot  remain  symmetric  under  pertur¬ 
bation  of  any  viewpoint.  Fig.  8  illustrates  how  a 
bilaterally  symmetric  curve  with  inflections  and 
discontinuities  undergoes  distortion  and  loss  of 
symmetry  under  orthographic  projection. 

In  case  the  back-projection  of  the  mirror  axis 


is  constrained  to  pass  through  the  center  of  a  line 
joining  a  pair  of  non-analytic  (inflection)  points, 
symmetry  cannot  be  maintained  under  general 
viewpoint  because  of  the  following  argument.  The 
requirement  that  the  line  joining  the  pair  of  non- 
analytic  (inflection)  points  be  perpendicular  to  the 
mirror  axis  in  the  projected  plane  leads  to  the  con¬ 
straint  that  the  back-projection  of  the  mirror  axis 
bisect  all  straight-line  segments  parallel  to  the  line 
joining  the  pair  of  non-analytic  (inflection)  points 
and  bounded  by  the  original  curve.  But,  for  any 
given  curve,  this  constraint  fixes  the  back- 
projection  of  the  mirror  axis,  if  any.  Hence,  this 
configuration  cannot  exhibit  symmetry  under  gen¬ 
eral  viewpoint  either. 

In  case  of  an  isolated  non-analytic  point  on 
the  mirror  axis  we  first  note  that  the  non-analytic 
point  must  be  a  tangent  discontinuity.  If  this  is 
not  so,  the  necessary  and  sufficient  condition 
described  above  cannot  be  satisfied  under  perturba¬ 
tion  of  the  viewpoint.  If  the  isolated  discontinuity 
is  a  tangent  discontinuity  the  only  admissible  local 
curve  which  would  exhibit  bilateral  symmetry 
under  general  viewpoint  is  a  pair  of  straight-lines. 
It  follows  that  with  the  exception  of  a  pair  of 
straight-lines  meeting  at  a  point,  the  curve  must  be 
analytic  and  without  inflections. 

Q.E.D. 

Lemma  2  :  With  the  exception  of  straight-lines,  for 
planar  curves  whose  orthographic  projections  exhibit 
local  bilateral  symmetry  under  general  viewpoint, 
the  back-projections  of  the  axes  of  local  symmetry 
under  different  viewpoints  intersect  at  a  single 
point.  (This  point  of  intersection  may  be  at 
infinity,  i.e.  all  the  back-projections  may  be  paral¬ 
lel.) 

Proof  :  If  the  curve  is  a  pair  of  straight-lines  meet¬ 
ing  at  a  point,  then  this  lemma  is  immediately  seen 
to  be  true  with  all  the  back-projections  of  the  mir¬ 
ror  axes  passing  through  the  meeting  point.  If  this 
is  not  so,  then  it  follows  from  the  previous  lemma 
that  the  curve  is  analytic  and  without  inflections. 
Consider  Fig.  7,  once  again.  As  argued  above,  a 
necessary  and  sufficient  condition  for  the  projected 
curve  to  exhibit  bilateral  symmetry  about  the  pro- 


962 


jection  of  M  for  some  viewpoint  is  that  any  line 
parallel  to  the  tangent,  T,  at  (x,y)  intersect  (if  at 
all)  the  original  curve  at  two  points,  (x  t,y  t)  and 
(x2,y  2),  whose  midpoint  lies  on  M.  Now  note  that 
under  perturbation  of  viewpoint  within  an  open  set 
on  the  Gaussian  sphere,  the  back-projection  of  the 
mirror  axis,  in  general,  cannot  continue  to  pass 
through  (x,y)  because  bilateral  symmetry  requires 
that  the  projections  of  M  and  T  intersect  at  right 
angles  over  and  above  the  condition  just  specified. 
The  only  exception  to  this  observation  is  the  degen¬ 
erate  case  of  a  straight-line  for  which  every  line 
through  every  point  is  the  back-projection  of  an 
axis  of  symmetry  for  some  viewpoint.  Disregarding 
straight-lines,  analyticity  about  (x,y)  implies  that  a 
perturbation  of  the  viewpoint  will  perturb  the 
intersection,  (x,y).  We  may  assume,  without  any 
loss  of  generality,  that  the  back-projection  of  the 
mirror  axis  under  some  different  viewpoint  passes 
through  (x  1(y  1).  Call  this  back-projection  M  [  and 
let  it  intersect  M  at  the  origin  O.  If  M  [  is  parallel 
to  M,  then  O  will  be  at  infinity.  By  the  same  argu¬ 
ments  as  those  just  presented,  it  is  seen  that  any 
line  parallel  to  the  tangent,  T  t,  at  (x  j ,y  ])  will 
intersect  (if  at  all)  the  original  curve  at  two  points, 
say  p  j  and  q  j,  such  that  their  midpoint,  r ,.  lies  on 
M|.  Now  notice  that,  given  the  segment  of  the 
curve  above  M,  the  segment  below  M  is  completely 
determined.  We  may  imagine  it  to  be  generated  in 
the  following  way.  First,  project  the  upper  seg¬ 
ment  orthographically  in  the  direction  for  which 
the  projection  of  M  is  the  mirror  axis.  Then,  reflect 
this  projection  about  the  mirror  axis  and  bark- 
project  it  onto  the  object  plane.  Let  us  now  con¬ 
sider  that  this  procedure  is  applied  not  only  to  the 
upper  segment  of  the  curve,  but  also  to 
T  j,  M  j,  p  1,  q  1  and  r  v  Let  the  corresponding  lines 
and  points  below  M  be  subscripted  by  2's.  Then  it 
is  not  hard  to  see  that  M2  will  pass  through  (x  2,y  2) 
and  O.  Also,  T  2  will  be  tangent  to  the  curve  at 
(x2,y2).  Further,  p2q2  will  be  parallel  to  T2  and 
r2,  the  midpoint  of  p2q2,  will  lie  on  M2.  Conse¬ 
quently,  M2  is  the  back-projection  of  the  mirror 
axis  through  (x2,y  2).  So,  we  have  shown  that  three 
back-projections  of  mirror  axes,  M  ,  M  j  and  M  2, 
intersect  at  one  point,  O.  As  this  argument  holds 


for  all  choices  of  points  and  lines  in  the  region 
above  M,  we  can  choose  (x  ,,y  ,)  arbitrarily  close  to 
(x,y).  Then,  we  can  proceed  to  apply  the  same 
argument  to  the  point  which  is  the  mirror  image  of 
(x,y)  about  Mj  under  orthographic  projection,  i.e. 
we  make  M  ]  play  the  role  previously  played  by  M 
and  conclude  that  the  back-projection  of  the  mirror 
axis  through  the  new  point  will  also  pass  through 
0.  Continuation  of  this  argument  leads  to  the 
desired  result. 

Q.E.D. 

Proof  of  Theorem  :  Consider  Fig.  7,  once  again. 
We  consider  only  curves  other  than  straight-lines 
and  a  pair  of  straight-lines  meeting  at  a  point,  both 
of  which  exhibit  local  bilateral  symmetry  under 
orthographic  projection  with  a  general  viewpoint. 
Lemma  1  now  allows  us  to  restrict  our  attention  to 
analytic  curves  without  inflections  and  Lemma  2 
tells  us  that  the  back-projections  of  the  mirror  axes 
either  all  intersect  at  a  point  in  the  finite  plane  or 
that  they  are  all  parallel.  If  the  fixed  point  is  not 
at  infinity,  we  may  assume  without  any  loss  of  gen¬ 
erality  that  it  lies  at  the  origin.  As  argued  before, 
a  necessary  and  sufficient  condition  for  M  to  be  the 
back-projection  of  a  mirror  axis  is  that  any  line 
parallel  to  T,  the  tangent  at  (x,y),  intersect  (if  at 
all)  the  curve  at  two  points,  (x  [,y  ,)  and  (x  2,y  2), 
whose  midpoint  lies  on  M.  We  express  this  obser¬ 
vation  in  the  following  two  equations.  Equation  (1) 
is  obtained  by  equating  the  tangent  at  (x,y)  to  the 
slope  of  the  line  joining  ( x  j . ^  1 )  and  ( J" 2 . y  2 ) -  Equa¬ 
tion  (2)  is  obtained  by  equating  the  slope  of  the 
back-projection  of  the  mirror  axis  to  that  of  the 
line  joining  (x,y)  with  the  midpoint  of  (?,,y,)  and 
(rC'!/:)-  We  call  this  slope  s(x,y).  It  is  a  constant, 
k,  when  the  back-projections  of  the  mirror  axes  are 
parallel  and  y/x  when  they  intersect  at  the  origin. 

dy_  =  Vi  -  -V; 

dx  X  ,  -  In 

,  ,  (»i  +  Vs)/2-l / 

»(*.!/)  =  7 - 7 - Tm -  2 

(i!  +  z2)/2  -  1 

If  we  let  Aj  =  Xi  -  x  and  A2  =  j2  -  x ,  then  we 
can  rewrite  the  two  equations  as  follows. 

V\  ~  Va  =  (Aj  -  A2)-^-  (3) 


963 


Vi  +  1/2  =  (A1  +  A2)  s(z,y  )  +  2 y  (4) 

Adding  and  subtracting  (3)  and  (4)  we  get, 

2j/i  =  2y  +  (A1+A2)  s(x  ,y  )  +  (Aj-A2)-^  (5) 

2y2  =  2y  +  (A,+A2)  s(x  ,y  )  -  (A,-A2)^  (6) 

We  now  write  out  the  Taylor  series  expansion  for 
and  j/2.  This,  of  course,  is  under  the  assumption 
that  y  is  a  function  of  x  in  the  neighborhood  of 
(x,y).  This  assumption  is  reasonable  because  were 
it  not  the  case,  we  could  always  rotate  the  axes 
without  any  loss  of  generality. 


dy 

„J=1/  +Al_  + 


A,3  d\ 


2  dx 2  6 

+  higher  order  terms 


dx ' 


(7) 


y?  =  y 


+ 

+ 


»  ,  A23  d3y 

"  dx  2  dx 2  6  dx3 

higher  order  terms 


We  equate  the  expressions  for  j/t  in  equations 
and  (7)  to  get  equation  (9)  and  the  expressions 
y2  in  equations  (6)  and  (8)  to  get  equation  (10). 


(8) 

(5) 

for 


s{x,y)~  ^  = 


dy  i 

T,  -  !(I  •» 


.  2,4Li!£  +  ±?ii  + 


dx * 


6  dx3 

dy_  _ 

dx 


.y 


o  ^2  d2y  ,  , 

“  2  dr2  6 


(9) 


(10) 


We  now  substitute  for  *n  terms  of  A]  from 
equation  (9)  into  equation  (10)  and  equate  the 
coefficients  of  the  A3  terms  on  the  two  sides  of  the 
resulting  equation  (the  coefficients  of  the  A!  and 
A2  terms  result  in  identities).  Simplification  leads 
to  the  following  third  order  non-linear  ordinary 
differential  equation. 


d  V 
dx3 


\t,- *<x  ■»>) 


0  (11) 


The  satisfaction  of  this  third  order  ordinary 
differential  equation  in  the  neighborhood  of  the 
intersection  of  the  back-projection  of  the  mirror 
line  and  the  original  curve  is  a  necessary  condition 
for  the  projected  curve  to  exhibit  bilateral  sym¬ 


metry  under  perturbation  of  the  viewpoint,  i.e. 
under  the  general  viewpoint  assumption. 

We  now  observe,  that  if  the  position,  tangent 
and  curvature  of  (x,  y)  completely  determine  any 
curve  which  also  satisfies  the  differential  equation, 
then  this  curve  must  be  the  unique  solution  to  the 
third  order  differential  equation  under  the  given 
conditions.  This  is  a  result  from  the  theory  of  ordi¬ 
nary  differential  equations3.  Let  us  assume  that  the 
position,  tangent  and  curvature  of  (x,y)  are  known. 
The  curvature,  k,  may  be  positive  or  negative.  It 
may  not  be  zero  because  straight-lines  have  been 
excluded  from  this  discussion  and  inflections  are 
disallowed  by  Lemma  1.  We  examine  the  various 
solutions  to  the  differential  equation  for  both, 
s(x,y)  =  y/x  and  s(x,y)  =  k.  First,  we  make  two 
observations  about  restrictions  on  the  position  and 
tangent  of  (x,y).  One,  (x,y)  cannot  lie  at  the  fixed 
point.  It  is  not  hard  to  see  that  this  would  make 
equations  (1)  and  (2)  unsatisfiable  under  perturba¬ 
tion  of  the  viewpoint.  Two,  the  tangent  at  (x.y) 
cannot  lie  along  the  back-projection  of  the  mirror 
axis.  This  follows  from  the  requirement  that  the 
mirror  axis  be  orthogonal  to  the  tangent  at  (x,y)  in 
the  projected  plane. 

Case  s(x  ,y)  —  y  / x ,  k  >  0  .  Unique  ellipse  cen¬ 
tered  at  the  origin.  The  parameterization, 
x  =  a  cos(t  to)  and  y  =  b  sin(t  -t  0),  easily 
verifies  (11). 

Case  s(x  ,y  )  =  y  / x  ,  k  <  0  :  Unique  hyperbola 
centered  at  the  origin.  The  parameterization, 
x  =  a  cosh(t  -t0)  and  y  =  b  sin h( t  — t  0),  easily 
verifies  (11). 

Case  ?(z  ,y  )  =  k ,  k  0  :  L'nique  parabola  with 


5  We  miy  rewrite  the  third  order  differential  equation 
as  a  system  of  first  order  differential  equations, 

y '  =  f(x  .y  i.y».  y  *)  =  [y  *.  ya,  *y  >3/(y  j-»(x  y  i))l 

where  y  =  (y  |,  y  j,  y  si  =  |y  ,  y  '  ,  y"  |.  The  primes 
denote  differentiation  with  respect  to  x.  Now  note  that 
(dy  /dx  -  s(x  y  ))  7^  0  because  if  the  slope  of  the  tangent 
at  (x,y)  were  the  same  as  that  of  the  back-projection  of  the 
mirror  axis  through  (xjr),  then  their  projections  could  not 
be  at  right  angles  for  any  viewpoint.  It  easily  follows  that 
f  satisfies  the  bipschitt  condition  (see  [Coddington,  ltfilj) 
and  hence  the  initial  value  problem, 
y'  =  f(x,y)with  y(x0)  =  y0,  has  at  mos'  one  solu¬ 
tion  on  any  interval  containing  x0. 


964 


axis  haviDg  slope  k.  The  parameterization,  y  =  t 
and  x  =  t2  with  k  =  0,  easily  verifies  (11). 

Hence,  we  see  that  segments  of  conic-sections 
are  the  only  non-zero  curvature  solutions  to  the 
differential  equation  whose  satisfaction  in  the  neigh¬ 
borhood  of  (x,y)  is  a  necessary  condition  for  local 
bilateral  symmetry  to  be  exhibited  by  an  analytic 
curve4.  As  an  aside,  straight-lines  are  the  zero  cur¬ 
vature  solutions  to  the  equation.  The  complete 
curve  is  uniquely  specified  by  analytic  extension. 
Now  note  that  orthographic  projections  of  ellipses, 
parabolas  and  hyperbolas  always  belong  to  the 
corresponding  familys.  Hence,  conic-section  seg¬ 
ments  do  project  to  curves  exhibiting  local  bilateral 
symmetry.  It  follows  that,  besides  straight-lines 
and  pairs  of  straight-lines  meeting  at  a  point,  seg¬ 
ments  of  conic-sections  are  the  only  planar  curves 
which  may  exhibit  local  bilateral  symmetry  under 

orthographic  projection  with  a  general  viewpoint. 

Q.E.D. 

Acknowledgement 

This  work  was  inevitably  influenced  by  Tom 
Binford,  who  introduced  the  author  to  computer 
vision.  Raz  Stowe  suggested  that  1  formulate  my 
arguments  towards  the  proof  of  the  theorem  in  the 
Appendix  using  a  third  order  differential  equation. 
Discussions  with  Hong-Seh  Lim  prompted  the  gen¬ 
eralization  of  Corollary  2  in  the  Appendix  to  the 

4  Instead  of  formulating  the  third  order  differential 
equation  we  could  equivalently  have  presented  a  procedure 
which  would  result  in  a  unique  numerical  solution  for  the 
curve,  given  the  position,  tangent  and  curvature  at  (xy). 

Given  this  information  at  (x,y),  one  could  first  predict,  us¬ 
ing  the  Taylor  series,  the  positions  and  tangents  at  the 
points  with  x-coordinates  x  ,  =  x  ±c.  Then,  as  s  ( . , . ) 
determines  the  back-projection  of  the  mirror  axis  through 
every  point  on  the  curve,  we  could  predict  the  position  and 
tangent  at  the  back-projections  of  the  mirror  images  of 
(xy)  about  the  mirror  axes  through  (x  j, . ).  The  mirror 
axes  through  these  newly  determined  points  could  now  be 
used  to  further  extend  the  curve,  and  so  on.  The  errors  in 
the  positions  and  tangents  of  (x,,.)  are  proportional  to  (’ 
and  <*,  respectively,  and  the  number  of  iterations  needed 
to  describe  the  curve-segment  is  proportional  to  l/e. 
Hence,  as  we  let  t  -*  0,  the  numerical  solution  will  con¬ 
verge  to  the  true  solution. 

*  This  can  be  verified  for  conic-sections  by  noting  that 
the  discriminant,  b 1  -  4ae ,  of  the  quadratic, 
ax’  +  bxy  +  cy’  +  dx  +  ey  +  f  =0,  does  not  change 
sign  under  orthographic  projection. 


Theorem.  This  work  was  supported  in  part  by  the 

Defense  Advanced  Research  Projects  Agency  under 

contract  N00039-84-C-021 1. 

References 

Attneave,  F.,  and  Frost,  R.,  “The  determination  of 
perceived  tridimensional  orientation  by  minimum 
criteria,’’  Perception  and  Psychophysics,  Vol.  6 
(6B),  1909,  391-396. 

Coddington,  E.A.,  An  Introduction  to  Ordinary 
Differential  Equations,  Prentice-Hall,  New  Jersey, 
1961. 

Coxeter,  H.S.M.,  Introduction  to  Geometry,  John 
Wiley  &  Sons,  New  York,  1961. 

Guillemin,  V.,  and  Pollack,  A.,  Differential  Topol¬ 
ogy,  Prentice-Hall,  New  Jersey,  1974. 

Hilbert,  D.,  and  Cohn-Vossen,  S.,  Geometry  and 
the  Imagination,  Chelsea,  New  York,  1962. 

Kanade,  T.,  “Recovery  of  the  Three-Dimensional 
Shape  of  an  Object  from  a  Single  View,” 
Artificial  Intelligence  17,  1981,  409-460. 

Koenderink,  J.J.,  and  van  Doom,  A.J.,  “The  shape 
of  smooth  objects  and  the  way  contours  end," 
Perception,  Vol.  11,  1982,  129-137. 

Koenderink,  J. J.,  “What  does  the  occluding  con¬ 
tour  tell  us  about  solid  shape?,”  Perception,  Vol. 
13,  1984,  321-330. 

Lipschutz,  M.M.,  Differential  Geometry,  McGraw- 
Hill,  New  York,  1969. 

Lowe,  D.G.,  and  Binford,  T.O.,  “The  Recovery  of 
Three-Dimensional  Structure  from  Image 
Curves,”  IEEE  Trans.  P AMI-7,  No.  3,  May  1986, 
320-326. 

Malik,  J.,  “Interpreting  Line  Drawings  of  Curved 
Objects,”  Doctoral  Dissertation,  Computer  Sci¬ 
ence  Department,  Stanford,  1985. 

Marr,  D.,  “Analysis  of  Occluding  Contour,”  Proc. 
R.  Soc.  Lond.  B.  197,  1977,  441-475. 

Nalwa,  V.S.,  “Line-Drawing  Interpretation 
Straight-Lines  and  Conic-Sections,”  Proc:  Image 
Understanding  Workshop,  Los  Angeles,  Feb.  1987. 


965 


Fig.  1.  Line-Drawing  with  Various  Types  of  Edges  Fig.  2-a.  Crystals  Exhibit  a  Symmetric  Atomic  Arrangement 

(The  Sodium  Chloride  Crystal  Structure  is  Shown) 


Fig.  2-b.  Wallpaper  Pattern  Exhibiting  Various  Symmetries 


Fig.  2-c.  A  Chair  Exhibits  Reflective  Symmetry  in  Space 


Fig.  3.  Bilateral  Symmetry  in  a  Planar  Figure  Maps  to 
Skewed  Symmetry  under  Orthographic  Projection 


Fig.  5.  A  Surface  of  Revolution 


yaxh 


Fig.  7.  The  Back-Projections  of  the  Mirror  Axes  must 
Intersect  at  a  Single  Point 


l 

i 

i 


z 


Fig.  4.  The  Coordinate  Frame 


Fig.  6.  Examples  of  Bilaterally  Symmetric  Line-Drawings 
from  which  Local  Surfaces  of  Revolution  may  be 
Deduced  (Cues :  L  -  Line,  E  -  Ellipse,  T  -  Tip) 


Fig.  8.  Bilateral  Symmetry  between  Pairs  of  Inflections 
and  Discontinuities  Maps  to  Skewed  Symmetry 
between  them  under  Orthographic  Projection 


967 


ESTIMATION  AND  ACCURATE 
LOCALIZATION  OF  EDGES  1 

Fatih  Ulupmar  and  Gerard  Medioni 

Institute  for  Robotics  and  Intelligent  Systems 
Departments  of  Electrical  Engineering  and  Computer  Science 
University  of  Southern  California 
Los  Angeles,  California  90089-0273 


Abstract 

The  Laplacian  of  Gaussian  (LoG)  operator  is  one  of  the 
most  popular  operators  used  in  edge  detection.  This  oper¬ 
ator,  however,  has  some  problems:  all  zerocrossings  do  not 
correspond  to  edges  and  also,  zerocrossings  do  not  always 
accurately  localize  edges.  In  this  paper,  we  offer  solutions 
to  these  two  problems  for  one  dimensional  signals,  such  as 
slices  from  images.  For  the  problem  of  bias,  we  propose 
different  techniques;  the  first  one  combines  the  results  of 
the  convolution  of  two  LoG  operators  of  different  standard 
deviations,  whereas  the  others  sample  the  convolution  with 
a  single  LoG  filter  at  two  points  besides  the  zerocrossing. 
In  addition  to  localization,  these  methods  allow  us  to  fur¬ 
ther  characterize  the  shape  of  the  edge.  We  also  discuss 
the  p'-oblem  of  edge  interaction  and  offer  a  partial  solution 
for  it.  Results  on  synthetic  and  real  images  are  presented. 


1  Introduction 

Processing  of  visual  information  begins  with  the  transfor¬ 
mation  of  a  very  large  amount  of  viewer  centered  visual 
input  into  a  set  of  meaningful  object  centered  description. 
Physical  edges  are  one  of  the  most  important  properties  of 
an  object.  In  the  projection  transform  onto  a  retina,  these 
edges  are  very  likely  to  create  intensity  changes.  The  re¬ 
verse  is  not  true,  as  an  intensity  change  can  have  its  source 
in  a  change  of  some  property  of  an  illuminated  surface,  such 
as  reflectance  or  surface  marking.  Therefore  it  is  reason¬ 
able  to  assume  that  the  very  first  step  in  the  analysis  of  an 
image  is  the  detection  of  intensity  changes. 

It  can  be  argued  that  just  the  position  of  edges  is  a  strong 
indication  of  shape,  as  they  allow  understanding  and  recog¬ 
nition  of  objects  in  cartoons.  We  believe  that  success  at 
subsequent  levels  in  a  vision  system  crucially  depends  on 

'This  research  was  supported  by  the  Defense  Advanced  Research 
Projects  Agency  under  contract  number  F3361S-84-K-1404,  mon¬ 
itored  by  the  Air  Force  Wright  Aeronautical  Laboratories,  Darpa 
Order  No.  3119. 


good  features  extracted  from  the  signal.  These  extracted 
features  should  have  the  properties  of  completeness,  that  is 
they  should  capture  all  of  the  information  present  in  the 
intensity  image,  and  of  meaningfulness,  that  is  they  should 
make  the  important  object  centered  properties  explicit. 

Since  the  work  of  Roberts  [15]  in  the  early  60’s,  a  number 
of  edge  detectors  have  been  proposed.  They  can  be  classi¬ 
fied  in  two  broad  classes,  depending  on  whether  they  use 
first  or  second  derivative  properties. 

1.1  First  Derivative  Edge  Detectors 

In  this  approach,  edges  can  be  detected  as  local  maxima 
of  the  image  convolved  with  a  first  derivative  operator. 
Roberts  [15]  and  Sobel  [3,5]  use  a  straightforward  imple¬ 
mentation  of  this  idea.  Differentiation,  however,  is  a  noise 
enhancing  operation,  forcing  researchers  to  perform  a  smooth 
ing  step  prior  to  differentiation,  as  described  in  [5]. 

Recently,  Canny  [2]  derived  the  “optimal”  shape  of  such 
an  operator,  based  on  signal  to  noise  ratio  considerations;  it 
is  a  linear  combination  of  four  exponentials  which  is  closely 
approximated  by  the  first  derivative  of  a  Gaussian.  This 
edge  detector  performs  rather  well  on  real  images  and  has 
been  used  by  many  authors.  However,  the  biasing  problem 
that  we  discuss  in  this  paper  also  affects  this  operator  since 
it  uses  a  Gaussian  function  as  a  smoothing  filter. 

It  should  be  noted  that  all  detectors  using  first  derivative 
require  a  thinning  step,  since  an  edge  responds  with  a  broad 
peak. 

1.2  Second  Derivative  Edge  Detectors 

Here,  edges  are  detected  as  the  location  where  the  second 
derivative  of  the  (smoothed)  image  crosses  zero.  One  of  the 
widely  used  operators  in  this  class  has  been  proposed  by 
Marr  and  Hildreth  [1],  and,  as  a  result,  the  performance  of 
most  edge  detectors  is  compared  to  it.  The  basic  idea  is  to 
convolve  the  image  with  a  Laplacian  of  Gaussian  mask.  The 


968 


amount  of  smoothing  is  controlled  by  the  variance  of  the 
Gaussian.  The  main  objection  to  the  use  of  this  operator 
has  been  that  it  has  poor  localization  properties  and  intro¬ 
duces  a  bias  in  the  edge  location  estimation,  see  [6,8,9,14]. 

Another  operator  of  interest  has  been  proposed  by  Har- 
alick  [8],  where,  edges  are  detected  as  the  zerocrossings  of 
the  second  directional  derivative  in  the  direction  of  the  gra¬ 
dient.  Derivatives  are  computed  by  fitting  a  facet  model 
to  the  sampled  intensity  values.  On  some  images,  the  re¬ 
sulting  edges  are  visually  better  than  the  ones  from  the 
Marr-Hildreth  detector.  But  this  method  is  computation¬ 
ally  more  expensive  than  the  Laplacian  of  Gaussian  opera¬ 
tor.  Torre  and  Poggio  [7]  give  an  excellent  analysis  of  the 
difference  between  the  Laplacian  and  the  second  derivative 
in  the  direction  of  the  gradient. 

Also  worth  noting  is  the  work  of  Nalwa  and  Binford  [6],  in 
which  a  Tanh  curve  is  fit  to  the  signal.  This  operator  is  not 
linear  and  does  not  lend  itself  to  a  multiscale  approach,  but 
claims  to  reduce  the  error  in  the  estimation  of  the  location 
of  an  edge. 

In  the  study  presented  here,  we  will  demonstrate  that  the 
bias  introduced  by  a  LoG  filter  can  be  corrected,  and  we 
will  show  different  methods  to  perform  this  correction,  with 
the  common  characteristic  that  they  all  perform  only  a  few 
local  operations. 

The  next  section  presents  an  analysis  of  the  zerocrossings 
obtained  after  convolution  with  a  LoG  operator.  We  first 
show  that  some  zerocrossings  do  not  correspond  to  edges, 
then  express  the  bias  in  position  as  a  function  of  the  param¬ 
eters  of  an  edge.  Section  3  presents  how  this  bias  can  be 
corrected  using  two  different  scales,  section  4  presents  two 
different  methods  to  correct  bias  using  a  single  scale.  In 
section  5,  we  present  another  source  of  error  in  edge  posi¬ 
tion  and  give  a  partial  solution  to  it.  Finally,  we  summarize 
our  results  and  outline  future  research. 

2  Analysis  of  the  Zerocrossings  from 
a  LoG  Operator 

The  Laplacian  of  Gaussian  operator,  introduced  by  Marr 
and  Hildreth  [1],  uses  a  Gaussian  function  to  filter  out  the 
noise  component  of  the  input  signal  and  takes  the  lapla¬ 
cian  of  the  resulting  smoothed  signal.  In  practice,  both  of 
the  above  processes  are  performed  in  one  step  by  convolv¬ 
ing  the  input  signal  with  a  Laplacian  of  Gaussian,  the  LoG 
operator.  The  LoG  operator  has  many  useful  properties, 
which  makes  the  operator  interesting,  such  as:  the  effect  of 
noise  can  be  reduced  by  changing  the  standard  deviation 
(o)  of  the  operator.  The  level  of  the  resolution  of  the  oper¬ 
ator  can  be  changed  by  manipulating  the  o.  Many  studies 


show  that,  the  Gaussian  function  is  a  very  good  approxi¬ 
mation  for  an  optimal  filter  to  be  used  in  edge  detection, 
see  [2,7].  In  addition,  the  convolution  can  be  performed 
very  efficiently,  either  by  approximating  the  LoG  by  a  DoG 
(Difference  of  Gaussians)  [1],  or  by  other  decompositions 
[10,11].  The  LoG  operator  has  many  useful  properties,  but 
it  also  has  some  problems: 

-  Zerocrossings  do  NOT  always  correspond  to  actual 
edges,  but  to  inflection  points  instead. 

-  Zerocrossings  do  not  always  indicate  the  true  position 
of  an  edge. 

2.1  False  Response  of  LoG  Operator 

Second  derivative  operators  generate  a  zerocrossing  when¬ 
ever  the  input  function  has  an  inflection  point.  This  in¬ 
flection  point  generally  corresponds  to  step  edges  in  the 
input  signal  but  not  always.  Figure  1  shows  such  a  case,  in 
which  the  middle  zerocrossing  does  not  correspond  to  any 
step  edge.  In  fact,  it  results  from  an  inflection  point  with 
a  tangent  parallel  to  the  x-axis,  and  therefore  has  a  small 
first  derivative.  An  easy  solution  for  this  problem  would  be 
to  apply  a  threshold  to  the  value  of  the  first  derivative  as 
proposed  by  Haralick[8|.  Thresholding,  however,  is  an  ill 
posed  problem,  as  there  is  no  guarantee  to  find  the  right 
threshold  for  any  input. 

Here  we  propose  a  method  which  does  not  require  any 
thresholding.  We  define  a  step  edge  as  an  inflection  point 
with  a  positive  maximum  or  negative  minimum  of  the  first 
derivative.  This  translates  into  a  simple  test: 

A  zerocrossing  correspond  to  an  edge  iff 

<  0 

Note  that  we  do  not  have  to  really  compute  /**,  but  instead 
sgn(f"'),  which  is  nothing  but  the  polarity  of  the  zerocross¬ 
ing  of  the  output  of  LoG  operator.  For  the  calculation  of  /' 
we  have  to  convolve  the  input  curve  with  the  first  derivative 
of  gaussian  of  the  same  o  as  used  in  convolution  with  sec¬ 
ond  derivative  of  gaussian.  Note  that  we  do  not  have  to  do 
this  at  every  point,  just  at  the  location  of  the  zerocrossing. 

2.2  Bias  Introduced  by  a  LoG  Operator 

In  the  literature  it  has  been  pointed  out  repeatedly  by  [8,9] 
and  [14],  that  a  LoG  operator  has  poor  localization  as  o 
grows,  in  fact  it  biases  the  original  position  of  the  edge. 
To  analyze  this  bias,  let  us  convolve  the  operator  with  the 
following  function: 


969 


kxx 

kjx  +  d 


if  x  <  t 
otherwise 


(1) 


This  defines  a  step  edge  at  location  x  —  t,  the  height  of 
the  step  is  d  and  the  slopes  either  sides  of  the  edge  are  ki 
and  kj.  The  result  of  the  convolution  with  a  LoG  operator 
is  (note  that  in  1-D,  the  LoG  operator  is  nothing  but  second 
derivative  of  gaussian  function): 


<7 

where  G(u)  is  the  gaussian  function  defined  as  (neglecting 
the  normalizing  constant 

—  U2 

G(u)  =  e  Jff5"  (3) 


The  zerocrossing  of  equation  2  is  located  at 


x,  =  t  + 


(4) 


Therefore  the  bias  is 


x,  —  t  = 


(5) 


This  is  shown  in  figure  2,  which  shows  a  step  edge  with  dif¬ 
ferent  slopes  on  each  side  together  with  the  corresponding 
convolution  with  a  LoG  operator.  Note  the  displacement 
of  the  zerocrossing  with  respect  to  the  original  location  of 
the  step  edge. 


We  present  two  different  methods  to  correct  the  bias  in¬ 
troduced  by  the  LoG  operator.  In  the  first  one,  we  perform 
the  convolution  at  two  different  resolutions  and  obtain  t 
by  solving  equation  4.  In  the  second  one,  we  use  a  single 
resolution  and  sample  the  convolved  output  at  more  loca¬ 
tions  besides  the  zerocrossing.  As  we  go  along,  we  illustrate 
our  approach  by  showing  the  processing  of  two  signals,  one 
synthetic  and  the  other  a  slice  from  a  real  image.  They  are 
displayed  in  figure  3  (real  image)  and  4  (synthetic).  For 
these,  (a)  shows  the  intensity  image  obtained  by  replicat¬ 
ing  a  slice,  (b)  shows  the  intensity  variation  along  any  of 
the  rows  of  the  image,  and  (c)  shows  the  displacement  of  ze¬ 
rocrossing  (-),  negative  extrema  (-  •),  and  positive  extrema 
(-  -)  as  a  function  of  the  filter  variance  o. 


3  Bias  Correction  Using  Multires¬ 
olution 


step  edge  detected  by  a  zerocrossing.  Equation  4  can  be 
rewritten  as  : 


where 


x,  =  t  +  co 
ki  -  kt 


(6) 

(7) 


In  equation  6,  there  are  two  unknowns,  t,  the  location  of 
the  step  edge  and  c,  the  ratio  of  slope  difference  on  each 
sides  of  the  edge  to  the  height  of  the  edge.  So,  we  need  two 
independent  equations,  or  two  samples  of  the  same  equa¬ 
tion  (equation  6)  to  solve  for  these  unknowns.  The  method 
of  multiresolution  uses  convolution  outputs  of  two  LoG  op¬ 
erators  with  different  o  values,  say  0\  and  oj.  Then  we 
have  two  linear  equations  in  t,  and  c. 


I  xMl  =  t  +  co\ 

\  x,,=t  +  co\ 

and  the  straightforward  solutions  for  t  and  c  are: 

~~  xK. 


o\ 


t  =  x„  + 


a\  -  o\ 


(8) 

(9) 

(10) 


With  these  we  do  not  only  find  the  actual  location  of  the 
step  edge,  t,  but  we  can  predict  the  movement  of  a  zero¬ 
crossing  as  o  changes,  since  equation  6  is  completely  solved. 


This  method  applied  to  the  signal  in  figures  3  and  4  is 
shown  in  figures  5  and  6.  Figure  5-6  (a)  shows  the  corrected 
position  (<r  =  0)  extrapolated  from  convolutions  with  two 
different  size  filters  (ct,  =  3.0  and  o2  =  3.5).  Figure  5- 
6(b)  shows  further  description  for  each  detected  edge:  po¬ 
sition  with  subpixel  accuracy,  together  with  the  quantity 

C  -  i  . 

This  method  has  some  problems: 

First  we  have  to  match  zerocrossings  of  two  convolution 
outputs  with  different  o  values.  As  clearly  seen  in  figure 
3(c),  zerocrossings  may  disappear  as  o  grows.  Thus,  we 
may  not  have  a  one  to  one  correspondence  of  zerocross¬ 
ings  in  two  convolution  outputs.  The  zerocrossings  in  the 
convolution  with  a  small  o  may  correspond  to  one  or  no 
zerocrossing  in  the  convolution  with  a  larger  o.  We  have 
to  solve  a  problem  of  correspondence,  which  requires  some 
more  decision  taking,  therefore  more  computation. 

The  second  and  more  important  problem  is  the  interac¬ 
tion  between  nearby  edges.  Since  the  LoG  operator  has  a 
relatively  large  support  (  approximately  7 o  pixels),  nearby 
features  affect  the  convolution  output  around  the  edge  of 
interest,  which  leads  us  to  distorted  or  incorrect  results. 


In  this  method,  the  input  signal  is  convolved  with  two  op¬ 
erators  of  different  o  values  to  find  the  true  location  of  the 


970 


I  Bias  Correction  Using  a  Single 
Resolution 


The  second  method  to  calculate  the  values  t  and  c,  is  to  use 
ust  one  convolution  output.  There  are  two  approaches  in 
ising  a  single  resolution.  The  first  one  uses  the  locations 
>f  the  maximum  minimum  and  zerocrossing  to  correct  the 
>ias.  The  second  method  uses  the  convolution  output  val- 
les  at  arbitrary  locations  around  the  zerocrossing. 


t.l  Using  Maximum  and  Minimum 

This  method  utilizes  the  locations  of  the  maximum,  mini- 
num  and  zerocrossings  associated  with  a  step  edge.  Before 
;oing  into  the  details  of  the  method,  let  us  derive  the  equa¬ 
tions  for  the  locations  of  the  maximum  and  the  minimum 
>f  the  convolution  output  for  the  same  step  edge  model. 
The  location  of  the  maximum  is: 

co1  +  aJ{c3o3  +  4) 

**  =  [ - - -]+*  (11) 

ind  the  location  of  the  minimum  is  : 

co3  -  aJ(c3a3  +  4) 

*»  =  [ -  2 - H+t  (12) 

where  c  is  given  by  equation  7.  When  we  sum  the  locations 
jf  the  extremes  (xp  and  xn)  and  subtract  the  location  of 
the  zerocrossing  we  get  the  true  location  of  the  step  edge, 
!,  a  very  elegant  invariant  of  the  problem. 

t  =  xp  +  zn-~i 7]  (13) 

where  x,  is  the  location  of  the  zerocrossing  in  the  convolu¬ 
tion  output,  and  xv  ,  xn  are  given  above. 

The  location  of  the  step  edge  is  not  the  only  information 
that  we  can  get  out  of  these  formulas.  We  can  also  obtain 
the  height  of  the  step  edge,  d,  and  the  slope  difference  at 
each  side  of  the  step  edge,  (kj  —  ki).  The  equation  for  the 
height  of  the  step  edge  can  be  obtained  from  equation  2: 

mo3  (s.-w)* 

d  =  — e  (14) 

where  m  =  (/  *  G)(x,  +  /).  That  is,  m  is  the  value  of  the 
convok  tion  at  location  x,  +  l  for  some  arbitrary  l.  Abo  the 
slope  differences  can  be  obtained  from  the  equations  6  and 
7,  given  as: 

(*»-*t  )  =  ^rf  (15) 


a  LoG  filter.  In  particular,  the  true  height  of  the  edge  can 
be  used  in  thresholding  the  output  of  the  operator  since  it 
is  usually  a  problem  to  find  a  natural  thresholding  criteria 
in  second  derivative  operators. 

In  figure  7  and  8  the  result  of  this  method  applied  to  the 
signab  in  figure  3  and  4  b  presented.  With  thb  method,  as 
in  the  previous  one,  we  cam  abo  predict  the  movement  of 
the  zerocrossing  and  the  extrema  of  the  convolution  output 
as  a  changes.  Figures  7-8  (a)  show  the  predicted  position 
of  extreme  and  zerocrossing  based  on  their  actual  position 
in  the  convolution  with  a  LoG  filter  of  variance  o  (here  a  — 
3.0).  Figures  7-8(b)  lbt  each  edge  location  together  with 
its  estimated  step  size  (d),  and  slopes  difference  (fc2  —  Jkt), 
and  the  amount  of  the  correction  in  localization  (z,  —  t). 


4.2  Second  Approach  to  Single  Resolu¬ 
tion 

Further  analysb  of  the  equations  show  that  we  do  not  need 
to  find  the  location  of  the  maximum  and  the  minimum  in 
the  convolution  output  to  find  t.  If  we  sample  the  convolu¬ 
tion  output  at  two  different  locations  say  at  x ,  +  lt  and  at 
x,  +  lj,  and  if  the  values  obtained  at  these  samples  are  m  i 
and  m2,  that  b 

m\  =  (— — ^ - —  -  (kt  -  ki)]G[x  -  t  +  h)  (16) 


m»  =  [~ - i - ~  —  (^2  —  ki)]G(x  —  t  +  12)  (17) 

o 

The  ratio  of  these  two  expressions  gives  us: 

mi  ,x.-t  +  li-  co*.  ,-(x,  -t  +  h)3  +  (x,  -t  +  /2)\ 

^  =  lz/-t  +  ,,.cg,]«*P( - - ) 

(i8) 

where  c  is  given  in  equation  7.  Abo  note  that  co3  is  equal 
to  x,  —  t,  given  in  equation  6.  With  this  substitution  we 
get: 

(u) 

This  equation  is  analytically  solvable  in  t 


t  =  x,~ 


-^loi^-ll+l3)  (20) 


As  presented  here,  we  do  not  only  get  the  actual  location  of 
the  step  edges  with  subpixel  accuracy  but  we  can  abo  get 
much  more  information  about  the  step  edge;  the  height  and 
the  slope  difference  on  each  side,  from  the  convolution  with 


In  the  above  equation  if  we  set  i2  =  — /1(  we  get  a  much 
simplified  equation,  but  when  working  with  dbcrete  data 
one  will  need  to  interpolate  the  values  mt  and  m2,  with  thb 
restriction  on  l  values.  The  simplified  equation  b  : 

I  <2i> 


971 


For  choosing  the  values  l2  and  l2  there  are  two  consider¬ 
ations.  First,  the  smaller  these  values  are  the  equation  will 
be  singular  and  ill  posed  and  the  accuracy  will  be  poor. 
Second,  the  larger  these  values  are  the  problem  of  inter¬ 
action  will  be  more  severe,  which  is  discussed  in  the  next 
section.  It  is  a  trade  off  between  these  two.  In  the  ideal 
case,  there  is  no  distorsion  and  no  interaction,  all  values 
should  produce  the  same  result. 

This  medthod  applied  to  the  signals  in  figures  3  and  4 
are  shown  in  figures  9  and  10.  Figures  9-10  (a)  show  the 
predicted  positions  of  extrema  and  zerocrossing  based  on 
their  actual  position  in  the  convolution  with  a  LoG  filter  of 
variance  a  (here  a  =  3.0),  and  (l\  »  1.5).  Figures  9-10(b) 
list  each  edge  location  together  with  its  estimated  step  size 
and  slopes  difference,  and  the  amount  of  the  correction  in 
localization. 

4.3  Problems  with  the  Model 


Both  of  the  above  methods,  the  multiresolution  and  sin¬ 
gle  resolution  methods,  work  perfectly  unless  we  have  the 
problem  of  interference  between  nearby  features,  which  dis¬ 
places  the  locations  of  extrema  and  zerocrossings,  thus  the 
accuracy  of  the  calculations.  The  multiresolution  method 
seems  to  work  better  in  the  case  of  the  interaction  of  nearby 
features,  however  both  methods  are  affected.  The  move¬ 
ment  of  zerocrossing  with  respect  to  a  can  result  either 
fc2  /  ki  for  that  edge  or  from  the  interference  of  nearby 
features.  The  multiresolution  method  samples  the  actual 
movement  of  zerocrossings  with  respect  to  a,  therefore  it 
also  partially  corrects  the  effect  of  interaction,  whereas  the 
single  resolution  method  works  on  one  sample  and  tries  to 
solve  the  problem  without  knowing  the  actual  movement  of 
the  zerocrossing.  One  should  note,  however,  that  it  reduces 
noticeably  the  bias  due  to  interaction  in  a  bump,  which  is 
studied  in  the  next  section. 


5  Solving  the  Interaction  Prob¬ 
lem 


The  problem  of  interaction  of  nearby  features  is  also  stud¬ 
ied.  The  first  solution  proposes  to  use  a  model  which  con¬ 
sists  of  two  features,  in  other  words,  a  model  which  ex¬ 
presses  the  relation  of  interaction  between  two  features. 
The  subsequent  sections  discusses  such  a  model  which  we 
call  a  “bump"  model. 


5.1  The  Bump  Model 


resentation  : 


if  —  a  <  x  <  a 
otherwise 


and  the  convolution  output  for  this  model  is 


(22) 


_ s2+ 2as+aa 

(B  *  G")(x)  =  ](-\dx  ~  adV7i'  +dx  +  ad)e  g 

(23) 

In  figure  11,  a  typical  bump  and  the  resulting  convolution 
output  is  shown.  The  locations  of  the  zerocrossings  are: 

+a 

aeo* 

x'  =  ~S i  (24) 

Note  that  the  solution  for  zerocrossings  given  above  is  not 
an  explicit  solution  in  terms  of  x,.  Unfortunately,  there  is 
no  explicit  analytical  solution  for  that  formula.  Since  there 
are  no  analytical  solutions  for  the  above  formula,  it  is  solved 
using  the  Newton-Rapson  method.  From  that  equation,  we 
can  find  the  o  value  (half  of  the  width  of  the  bump),  and 
the  location  of  the  bump  is  just  in  the  middle  of  the  two 
zerocrossing  that  we  get  from  the  convolution.  In  addition, 
we  are  able  to  find  the  height  of  the  bump  by  the  following 
equation: 


*2»  +2  ox^+a2 
2*<ra 


d  =  — - lax -  (25) 

(-(zm  -  a)e^T  +  xm  +  a) 

where  xm  is  the  local  extremum  between  zerocrossings  and 
m  is  the  magnitude  of  convolution  at  location  xm. 


This  bump  model  is  in  fact  not  a  solution  to  the  problem 
of  interaction  of  nearby  features,  as  it  creates  more  prob¬ 
lems  than  it  solves:  First,  it  is  a  very  unflexible  model,  as 
it  restricts  the  two  sides  of  the  bump  to  be  equal,  which  is 
rarely  the  case  in  real  images  of  course.  The  second  prob¬ 
lem  is  that  this  model  predicts  two  edge  locations  for  each 
zerocrossing,  if  we  have  two  bumps  close  to  each  other.  The 
third  problem  is  that  it  only  considers  the  interaction  of  two 
step  edges,  whereas  in  the  real  world  we  have  interaction 
of  more  than  two  edges  (  practically,  all  edges  closer  than 
the  distance  3cr).  The  first  problem  can  be  solved  by  using 
a  more  flexible  model  as: 

f  d2  if  t  -  a  <  x  <  t  +  a 

B(x)  =  <  dl+  d2  if  x  >  t+a  (26) 

I  0  otherwise 


This  is  a  bump  with  different  heights  on  each  side,  d2  on 
left  side  and  —  d2  on  the  right  side.  The  convolution  output 
for  this  model  is: 


2a(z-t)  (z-t)3+lalz-t)+a3 

(d,((z  -  t)  -  a)e +  d2((x  -  t)  +  _  J  } 

<TJ 

(27) 

And  the  equation  for  zerocrossing  is 


t 


i 


x 

i 


i 


To  solve  the  interaction  problem,  a  new  model,  which  we 
call  a  bump,  is  used  which  has  the  following  functional  rep- 


972 


This  is  not  an  explicit  solution  either.  In  fact,  it  is  not 
possible  to  solve  this  equation  in  this  form  since  there  are 
many  unknowns  (a,  dl,  d2  and  t  which  is  the  location  of  the 
middle  of  the  bump).  Therefore  applying  this  model  is  no 
good  either.  Consequently,  the  bump  model  does  not  give 
any  help  in  solving  the  problem  of  interaction.  We  believe 
that  the  problem  of  edge  interaction  is  later  solved  by  Chen 
and  Medioni  [11]. 


6  Conclusion  and  Future  Work 

Here  we  have  presented  a  method  to  refine  the  position  of 
edges  after  the  detection  of  the  zero  crossings  of  the  original 
signal  convolved  with  a  LoG  operator.  We  also  estimate  the 
shape  of  each  edge,  including  the  height  and  the  difference 
of  slopes  on  each  side. 

As  stated  in  section  4.3  the  main  disadvantage  of  this 
technique  is  that  it  does  not  solve  the  edge  interaction  prob¬ 
lem,  which  is  studied  and  solved  in  [11],  That  study,  how¬ 
ever,  does  not  deal  with  the  bias  introduced  by  the  slope 
differences  of  the  step  edge.  We  believe  that  a  combination 
of  these  two  methods  will  produce  much  better  results. 

The  second  step  is  the  extension  to  2-D.  One  solution  is 
to  perform  convolution  in  rows  and  columns  of  the  image 
and  add  the  results.  Theoretically  having  the  information 
about  the  derivative  of  a  2-D  surface  in  two  independent  di¬ 
rections  is  sufficient  to  know  the  derivative  in  all  directions. 
This  is  true  when  noise  is  neglected.  Still,  this  method  is 
quite  sufficient  for  finding  edges  in  an  image.  But  the  true 
application  to  2-D  is  achieved  when  the  equations  are  mod¬ 
eled  in  2-D  and  solved  in  2-D.  However,  these  equations 
may  be  practically  too  difficult  to  solve  as  they  are  solved 
in  1-D  slices. 


References 

[lj  Marr,  D.  C.  and  Hildreth,  E.  C.,  “Theory  of  Edge  De¬ 
tection,”  Proceedings  of  the  Royal  Society  of  London, 
B207,  1980,  pp  187-217. 

[2]  Canny,  F.  R.,  “Finding  Edges  and  Lines  in  Images,” 
Technical  Report  720,  MIT  AI  Lab.,  June  1983. 

[3]  Nevatia,  R.,  “Machine  Perception,”  Princeton  Hall, 
1982. 

[4]  Nevatia,  R.  and  Babu,  K.  R.,  “Linear  Feature  Ex¬ 
traction  and  Description,”  Computer  Graphics  and 


Image  Processing,  Vol.  13,  1980  pp  257-269. 

(5j  Rosenfeld  A.  and  Kak  A.  C.,  “Digital  Picture  P-o- 
cessing,”  Academic  Press,  1982,  Vol.  2. 

[6]  Nalwa,  V.  S.  and  Binford,  T.  O.,  “On  Detecting  Edges,” 
Proceedings  of  Image  Understanding  Workshop,  Mi¬ 
ami,  December  1985,  pp  450-465. 

[7]  Torre,  V.  and  Poggio,  T.  A.,  “On  Edge  Detection,” 
IEEE  Trans,  on  Pattern  Analysis  and  Machine  Intel¬ 
ligence,  Vol.  8,  No.  2,  March  1986,  pp  147-163. 

[8]  Haralick,  R.  M.,  “Digital  Step  Edges  from  Zerocross- 
ings  of  Second  Directional  Derivatives,”  IEEE  Trans, 
on  Pattern  Analysis  and  Machine  Intelligence,  Vol.  6, 
No.  1,  January  1984,  pp  58-68. 

[9]  Berzins,  V.,  “Accuracy  of  Laplacian  Edge  Detectors,” 
Computer  Vision,  Graphics  and  Image  Processing, 
Vol.  27,  1984,  pp  195-210. 

[10]  Huertas,  A.  and  Medioni  G.,  “Detection  of  Intensity 
Changes  with  Subpixel  Accuracy  Using  Laplacian- 
Gaussian  Mask,”  IEEE  Trans,  on  Pattern  Analysis 
and  Machine  Intelligence,  Vol.  8,  No.  5,  September 
1985,  pp  651-664. 

[11]  Chen,  J.  S.  and  Medioni,  G.,  “Detection,  Localization 
and  Estimation  of  Edges,”  Intelligent  Systems  Group, 
USC,  Internal  Report,  1986,  to  be  published. 

[12]  Chen,  J.  S.,  Huertas,  A.  and  Medioni,  G.,  “Very  Fast 
Convolution  with  Laplacian-of-Gaussian  Mask,”  Pro¬ 
ceedings  CVPR-86,  Miami,  June  1986,  pp  293-298. 

[13]  Watson,  L.,  et  al  “Topographic  Classification  of  Digi¬ 
tal  Image  Intensity  Surfaces  Using  Generalized  Spline 
and  the  Discrete  Cosine  Transformation,”  Computer 
Vision  Graphics  and  Image  Processing,  Vol.  29,  1985, 
pp  143-167. 

[14]  Shah,  M.,  Sood,  A.  and  Jain,  R.,  “Pulse  and  Staircase 
Models  for  Detecting  Edges  at  Multiple  Resolution,” 
Proceedings  of  the  Third  Workshop  on  Computer  Vi¬ 
sion:  Representation  and  Control,  Bellaire,  Michigan, 
October  1985,  pp  84-95. 

[15]  Roberts,  L.  G.,  “Machine  Perception  of  Three  Dimen¬ 
sional  Solids,”  Optical  and  Electro  Optical  Informa¬ 
tion  Processing,  ed.  Tippett,  J.  T.,  et  al.,  Cambridge, 
Mass.,  MIT  Press,  1965,  pp  150-197. 


Figure  1:  Zerocrossing  do  not  always  correspond  to  edges: 
a  staircase  function  and  its  convolution  with  a  LoG  mask  {tr  =  2.0). 


Figure  2:  Zerocrossings  do  not  accurately  locate  edges:  a  step  function 
(ki  =  1.0  ,  kt  =  0.5  ,  d  —  10)  and  its  convolution  with  a  LoG  mask 
(a  =  2.0). 


Figure  4:  (a)  Intensity  image  obtained  by  replicating  a  synthetic  1-D 
signal,  (b)  The  signal  used  to  produce  the  image  in  (a),  (c)  The  displace¬ 
ment  of  zerocrossing  (-),  negative  extrema  (-  •),  and  positive  extrema  (-  -) 
in  the  convolution  with  LoG  mask  as  a  changes  from  1.0  to  8.5 


feature  no 

location 

ratio:  () 

1 

20.94 

-0.14 

2 

24.48 

0.16 

3 

43.02 

0.05 

4 

72.03 

-0.05 

5 

110.11 

-0.06 

6 

114.10 

0.32 

7 

127.92 

-0.30 

9 

137.58 

-0.02 

(b) 


(a) 

Figure  5:  Application  of  the  multiresolution  method  to  the  real  signal 
in  figure  3  with  Oj  =  3.0  and  a2  =  3.5 

(a)  The  predicted  a  —  x  space,  (b)  the  listing  of  features. 


feature  no 

location 

ratio:  (1 

1 

29.46 

0.06 

2 

72.42 

0.01 

3 

84.50 

-0.01 

4 

124.52 

0.13 

5 

165.51 

-0.04 

(b) 


(a) 


Figure  6:  Application  of  the  multiresolution  method  to  the  synthetic 
signal  in  figure  4  with  c>\  =  3.0  and  a2  =  3.5 

(a)  The  predicted  a  —  x  space,  (b)  the  listing  of  features. 


977 


feature  no 


location 

<k2-kl) 

d 

correction 

20.52 

-2.13 

22.25 

0.86 

25.13 

-1.76 

-20.59 

-0.77 

43.12 

1.60 

40.05 

-0.36 

71.92 

1.52 

-41.40 

0.33 

109.95 

-4.29 

101.86 

0.38 

114.70 

-12.86 

-51.46 

-2.25 

126.28 

0.86 

-7.37 

1.05 

137.59 

0.40 

-22.45 

0.16 

(b) 


(a) 

Figure  7:  The  single  resolution  method  using  maxima  and  minima  applied 
to  the  signal  in  figure  3  with  a  =  3.0 

(a)  The  predicted  a  —  x  space,  (b)  the  listing  of  features,  showing  the 
height  of  the  step  edge  and  location  correction. 


5 


feature  no 
1 
2 

3 

4 

5 


location  (k2-kl) 

29.53  4.94 

72.32  -1.67 

84.68  1.91 

124.50  4.69 

165.48  3.02 


d  correction 
92.94  -0.48 

-80.06  -0.19 

-91.37  0.19 

35.00  -1.21 

-86.23  0.32 


40  60  80  100  120  140  160  180  200 


(b) 


(a) 

Figure  8:  The  single  resolution  method  using  maxima  and  minima  ap¬ 
plied  to  the  signal  in  figure  4  with  a  =  3.0 

(a)  The  predicted  a  —  x  space,  (b)  the  listing  of  features,  showing  the 
height  of  the  step  edge  and  location  correction. 


'Hip  bump  model  and  its  convolution  with  a  LoG  mask  for 


980 


Edge-Detector  Resolution  Improvement  by  Image  Interpolation 


Vishvjit  S.  Nalwa 

A. I.  Lab.,  Stanford  University,  CA  94305 


Abstract 

Most  step-edge  detectors  are  designed  to  detect 
locally  straight  edge-segments  which  can  be  isolated 
uithin  the  operator  kernel.  While  it  can  easily  be 
demonstrated  that  a  cross-sectional  support  of  at  least 
4  pixels  is  required  for  the  unambiguous  detection  of  a 
step-edge,  edges  which  cannot  be  isolated  within  win¬ 
dows  haring  this  width  can  nevertheless  be  resolved. 
This  is  achieved  by  preceding  the.  step-edge  detection 
process  by  image-intensity  interpolation.  Although 
resolution  can  be  improved  in  this  fashion,  the  step- 
edge  position  and  intensity  estimates  thus  determined 
may  be  subject  to  systematic  biases.  Also,  the  higher 
resolution  performance  is  accompanied  by  lower 
robustness  to  noise. 


I.  Introduction 

Most  step-edge  directors  are  designed  to  detect 
locally  straight  edge-segments  which  can  he  isolated 
within  the  operator  kernel  (e.g.  [s],  j5j.  [!)]).  It  was 
pointed  out  in  [flj  that  the  minimum  lateral  support 
required  for  the  unambiguous  detection  of  a  step-edge 
was  I  pixels.  The  underlying  implication  was  that 
step-edges  for  which  such  a  support  was  unavailable 
would  require  special-purpose  detectors,  e.g.  line- 
detectors,  It  is  demonstrated  here  that  this  is  often 
unnecessary.  Contrary  to  common  belief  (see  [x],  [tj. 
1*1  •  [OH-  so-called  line-edges  can  be  detected  by  step- 

edge  detectors  if  we  first  reconstruct  . . deriving 

image-intensity  surfai e. 

bine-edges  have  often  been  mentioned  in  litera¬ 
ture  (see  [I])  without,  to  the  best  of  our  knowledge,  it 
ever  being  explicitly  stated  when  two  locally  parallel 
straight  step-edges  constitute  a  line-edge,  i  e.  at  what 
separation  between  two  step-edges  do  we  consider 
them  to  constitute  a  single  line-edge,  and  why?  The 
answer,  we  claim  is  system  dependent.  As  discussed 
at  length  in  [9],  edges  undergo  “blurring"  when 
registered  by  any  imaging  device1.  Whenever  this 
“blur”  is  sufficient  to  cause  the  images  of  two  parallel 


This  paper  has  been  accepted  for  publication  in  //././.  Transac¬ 
tions  on  t’allrrn  Analysis  and  Maehint  Inti  ttigi  nr, . 

This  work  was  supported  in  part  by  the  Defense  Advanced 
Research  Projects  Agency  under  contract  NIHMl.'V.t-H t-(  1 1 


step-edges  to  interact,  we  have  a  line-edge.  This  is 
illustrated  in  Fig.  1  for  the  1-D  continuous  case.  If  we 
assume  the  blurring  function  to  be  well-approximated 
by  a  Gaussian,  as  is  often  done  for  simplicity,  then  we 
can  consider  pairs  of  step-edges  separated  by  less  than 
5crblur  to  constitute  line-edges.  (Typically,  <rblur  >= 
0.5  pixel  for  imaging  systems  with  little  aliasing.) 

In  this  paper  we  seek  to  demonstrate  how  so- 
called  line-edges  may  be  detected  by  a  step-edge 
detector.  This  is  achieved  by  reconstructing  the 
underlying  image-intensity  surface.  Line-edges  need 
to  be  discovered  nevertheless,  not  because  it  is  not 
possible  to  detect  them  as  a  pair  of  step-edges,  but 
because  we  want  to  accurately  recover  the  parameters 
of  the  underlying  steps.  As  is  evident  from  Fig.  1,  the 
interaction  between  the  images  of  the  component 


Unblurred  Profile 


Blurred  Profile 


Non-Interacting  Step-Edges 


Interacting  Step-Edges 
(Line-Edge) 


Fig.  1.  Non-Interacting  and  Interacting  Step- Edges 


1  There  are  two  fundamental  sources  for  the  “blur"  associated 
with  any  imaging  system.  One  is  the  diffraction  limit  (see  (6)) 
and  the  other  is  the  finite  sensor  aperture.  Optical  aberrations 
and  defocus  may  also  contribute  (see  |6|). 


981 


Given  Samples 


Fig.  2.  Ambiguity  in  profile,  given  3  sym¬ 
metric  samples  of  a  step-edge  cross-section. 


step-edges  will  result  in  systematic  errors  in  their  posi¬ 
tion  and  intensity  estimates  —  the  discovery  of  line- 
edges  is  needed  to  remove  these  biases.  (In  general, 
non-zero  slopes  on  the  two  sides  of  step-edges  will  also 
bias  the  parameter  estimates.)  We  concern  ourselves 
here  with  the  detection  aspect  and  do  not  address  in 
any  detail  the  issues  pertaining  to  the  systematic 
errors. 

As  an  aside,  we  mention  that  it  has  been  specu¬ 
lated  in  the  psychophysical  literature  that  explicit 
image  interpolation  is  performed  by  the  human  visual 
system.  For  instance.  Barlow  [3]  has  suggested  that 
discrete  spatial  interpolation  between  retinal  samples 
may  be  performed  at  the  terminations  of  the  optic 
radiation  fibers  in  the  cortex. 

In  Section  II  the  underlying  procedure  is  outlined 
and  in  Section  III  the  design  of  the  interpolation  filter 
is  discussed.  Section  IV  presents  some  examples  with 
which  we  hope  to  highlight  the  methodology  and 
demonstrate  its  performance  on  "real"  images.  We 
conclude  in  Section  V  with  a  discussion. 


II.  The  Procedure 

In  [9]  it  was  illustrated  that  3  symmetric  samples 
were  insufficient  to  detect  a  1-D  step-edge  owing  to 
inherent  ambiguities.  This  is  depicted  in  Fig.  2.  It  is 
clear,  however,  that  if  we  had  available  the  continu¬ 
ous  signal  itself,  then  our  window  size  would  not  be 
thus  restricted,  i.e.  a  3  pixel  window  on  the  continu¬ 
ous  signal  has  no  such  ambiguity.  But,  if  our  signal  is 
by  and  large  bandlimited,  then  we  can  use  an  interpo¬ 
lation  filter  to  reconstruct  the  underlying  continuous 
signal.  This  is  a  reasonable  assumption  if  <xb|ur,  for 
the  effective  Gaussian  blurring  function  associated 
with  the  imaging  system,  is  no  less  than  0.5  pixel. 
Hence,  although  we  do  require  more  than  3  samples  to 
detect  a  step-edge,  it  is  not  necessary  that  we  be  able 
to  isolate  the  step-edge  within  such  a  window.  We 
can  instead  use  the  samples  to  reconstruct  the  con¬ 
tinuous  signal  and  then  detect  step-edges  using 


Fig.  3.  Procedure  for  detecting  step-edges  us¬ 
ing  interpolation. 

smaller  windows.  This  procedure  is  illustrated  in  Fig. 
3  for  the  samples  of  a  line-edge.  Ail  the  foregoing 
arguments  carry  over  to  2-D.  In  practice,  we  need  not 
reconstruct  the  continuous  signal.  Interpolation  onto 
a  finer  grid  often  suffices.  For  purposes  of  demonstra¬ 
tion,  we  chose  our  new  grid  to  have  the  half  the  origi¬ 
nal  inter-pixel  spacing. 

III.  Design  of  Interpolation  Filter 

The  choice  for  an  interpolation  filter  which 
immediately  comes  to  mind  for  a  square  sampling  grid 
is  sinc(x)  sinc(y)  where  x  and  y  are  the  row  and 
column  axes  and  sinc(x)  =  sin(xx)  (see  ^2] ).  The 

71  X 

sine  function,  however,  is  of  infinite  extent  and  there¬ 
fore  practically  unfeasible.  We  could  window  the 
filter.  But  then  we  would  be  confronted  with  the 
choice  of  the  windowing  function  with  its  accompany¬ 
ing  trade-offs  (see  [10]).  An  alternative  procedure  is  to 
design  a  filter  which  has  certain  desirable  characteris¬ 
tics. 

Taking  the  cue  from  the  sine  filter,  we  can  make 
two  immediate  observations.  First,  the  2-D  filter  can 
be  conceived  of  as  the  separable  product  of  two  ident¬ 
ical  1-D  filters.  Hence,  we  need  only  design  a  1-D 
filter.  Second,  in  order  to  reconstruct  the  signal  while 
preserving  the  given  samples  we  must  have  unit  mag¬ 
nitude  at  the  origin  of  the  filter  and  zeros  at  integer- 
pixel  distances  from  the  origin.  As  mentioned  previ¬ 
ously,  for  our  purposes  it  suffices  to  interpolate  the 
intensities  onto  a  grid  with  half  the  original  inter-pixel 
spacing.  Hence,  we  need  not  design  a  continuous 


982 


1 

0 

-9 

-16 

-9 

0 

1 

0 

0 

0 

0 

0 

0 

0 

•9 

0 

SI 

144 

81 

0 

-9 

-16 

0 

144 

256 

144 

0 

-16 

-9 

0 

81 

144 

81 

0 

-9 

0 

0 

0 

-- 

0 

0 

0 

0 

1 

0 

-9 

-16 

-9 

0 

1 

Fig.  4.  Interpolation  filter  designed  to 
preserve  functions  belonging  to  the  space 
spanned  by  {1  x  x2  x3}  ®  {1  y  y2  y3}.  The 
sampling  grid  for  this  filter  has  a  half-pixel 
spacing. 


filter.  It  is  enough  to  have  its  samples  at  half-pixel 
spacings.  Thus,  if  we  decide  on  a  1-D  filter  with  a  4 
pixel  support,  its  coefficients  will  be  {0  a  0  b  1  c  0  d 
0},  where  the  coefficients  are  listed  at  half-pixel  spac¬ 
ings  about  the  origin.  The  unspecified  coefficients  can 
be  systematically  chosen  by  requiring  that  functions 
spanned  by  a  given  basis  be  reconstructed  without 
distortion.  We  chose  {1,  x,  x2,  x3}  to  be  that  basis. 
This  is  the  lowest  order  polynomial  basis  that  can 
represent  inflection  points  which  are  exhibited  by 
every  step-edge.  It  is  clear  that  if  we  choose  the 
coefficients  to  be  symmetric  about  the  center,  all  odd 
powers  of  x  would  be  undistorted  at  the  origin.  If  we 

further  require  2(a  +  b)  =  1  and  2(-^-a  +  -i-b)  =  0 
then  we  will  meet  our  specified  criteria.  Solving  these 
equations,  we  get  -^{0  -1  0  9  16  9  0  -1  0}  to  be  the 

filter  which  will  produce  distortionless  reconstruction 
for  the  space  spanned  by  {1  x  x2  x3}.  It  is  of  interest 
to  compare  this  filter  with  the  samples  of  the  trun¬ 
cated  sine  function  at  half-pixel  spacings  :  —{0  -3.4 

0  10.2  16  10.2  0  -3.4  0}.  The  2-D  filter  corresponding 
to  the  separable  product  of  our  1-D  interpolation  filter 
is  shown  in  Fig.  4. 

As  with  any  image-operator,  we  are  interested  in 
the  filter’s  uoise  characteristics.  Let  us  assume,  as  is 
generally  done  to  keep  the  analysis  simple,  that  the 
noise  is  independently,  identically  distributed  (i.i.d.) 
zero-mean  Gaussian  with  variance  <r2oiM.  Then  the 
variance  of  the  noise  in  the  interpolated  intensities 
will  be  0.41<r2oiM  when  the  interpolated  pixel  has  4 
nearest  neighbors  on  the  original  grid  and  0.64<r2ojse 
when  it  has  2  nearest  neighbors  on  the  original  grid. 
The  noise  in  the  interpolated  image,  of  course,  is  nei¬ 
ther  identically  nor  independently  distributed  any 
more.  Further,  we  should  keep  in  mind  that  sys¬ 


tematic  errors  in  the  reconstruction  are  inevitable 
when  the  sampled  surface  within  the  interpolation 
window  does  not  belong  to  the  space  spanned  by  {lx 
x2  x3}  ®  (1  y  y  y3)  (see  Appendix).  In  contrast  with 
our  finite-extent  interpolation  filter,  the  infinite-extent 
sine  filter  has  an  associated  noise  variance  of  <7*oise 
(see  [2] ).  However,  it  reconstructs  all  adequately 
bandlimited  signals  without  systematic  errors. 

Higher  order  filters  can  be  similarly  designed.  It 
must,  however,  be  borne  in  mind  that  as  the  size  of 
the  filter  increases,  so  does  the  basis  required  to 
closely  approximate  the  surface  within  the  window. 

Before  we  proceed  to  the  next  section,  we  men¬ 
tion  that  there  is  a  large  body  of  literature  on  func¬ 
tion  interpolation  and  approximation  (see  [1]  and  the 
references  within).  Various  competing  schemes 
include  spline-fitting  and  surface  modeling  by  least- 
squares-fit.  Least-squares  surface-fitting  has  previ¬ 
ously  been  used  for  edge-detection  (e  g.  [7]).  However, 
this  implicitly  smooths  the  image  data  and  conse¬ 
quently,  as  will  be  discussed  at  length  in  Section  V,  is 
inappropriate  for  the  detection  of  line-edges.  We  have 
not  investigated  the  tradeoffs  accompanying  the  vari¬ 
ous  possible  interpolation  strategies.  Our  primary 
purpose  here  is  to  elucidate  the  usefulness  of  image 
interpolation,  in  general,  as  a  method  for  resolution 
improvement  of  edge-detectors. 

IV.  Examples 

We  first  demonstrate  the  proposed  algorithm  on 
I-D  line-edges  with  varying  widths.  Figs.  5-a.  -b  and 
-c  exhibit  the  unblurred  profile  (thin  line),  the  blurred 
profile  (think  line),  the  original  samples  (circled  dots) 
which  are  available  for  processing,  and  the  interpo¬ 
lated  values  (uncircled  dots)  at  half-pixel  spacings.  for 
step-edges  spaced  3,  2  and  I  pixels  apart,  respectively. 
In  all  tiiree  cases,  the  step  on  the  left  has  a  high  of 
192  and  a  low  of  128  while  the  step  on  the  right  has  a 
high  of  192  and  a  low  of  64.  The  positions  of  the 
steps  are  indicated  directly  below  them  in  the  figures. 
crb|ur  for  the  blurring  Gaussian  used  in  all  three  cases 
is  0.6  pixel.  The  estimated  step-edge  parameters  are 
shown  alongside  the  unblurred  steps.  The  step-edge 
detector  outlined  in  [9]  has  been  used  here.  As  is  evi¬ 
dent  from  the  figures,  the  biases  in  the  position  and 
intensity  estimate's  increase  as  the  spacing  between 
the  two  step-edges  decreases  for  a  given  crblur.  It  is 
apparent  from  the  blurred  profiles  that  this  trend  in 
the  biases  is  not  particular  to  the  chosen  edge- 
detector. 

We  now  illustrate  the  improvement  in  perfor¬ 
mance  accompanying  the  application  of  the  proposed 
technique  to  “real”  images.  Fig.  6-a  is  the  original 
(128  x  128)  image  of  a  bin  of  parts.  Fig.  6-b  is  the 
corresponding  edge-detector  output  adapted  from  [9], 
Fig.  6-c  is  the  result  of  applying  the  same  edge- 
detector  with  identical  parameters  to  the  interpolated 
image.  It  is  clear  that  interpolation  onto  a  finer  grid 
results  in  a  marked  improvement  in  the  resolution. 
Many  of  the  detected  edges,  e.g.  those  corresponding 
to  the  pins  in  the  top-right  of  the  image,  are  definitely 
less  than  5<rb|ur  apart  (<rblur  was  estimated  to  be  0.6 


983 


Estimated 

‘ttep-Parametera 

Position  -  3.9 


Eatlmated 
Step-Paramete 
Poaition  -  6. 


Eat i mated 
Step-Parameter 
Poaition  •  4.0 


Step-Par 

Poaition 


—  Unblurred  Profile 

—  Blurred  Profile 
Original  Sample* 

#  Interpolated  Value* 


Fig.  6-c.  Detection  and  estimation  of  step- 
edges  spaced  1  pixel  apart  (<rb.  =  0.6). 


Fig.  6-c.  Bin  of  Parts  :  Edge  image  with  In 

terpolation 


Fig.  7- a.  Indoor  Scene  :  Original  Image  (256  x  256) 


Fig.  7-b.  Indoor  Scene  :  Edge  Image  without 
Interpolation  (adapted  from  [9]) 


pixel). 

Fig.  7-a  is  the  original  (256  x  256)  image  of  an 
indoor  scene  comprising  of  a  cup  with  a  pencil  inside 
it,  a  telephone,  and  a  book,  all  placed  on  a  desk.  Fig. 
7-b  is  the  corresponding  edge-detector  output  adapted 
from  [9].  Fig.  7-c  is  the  result  of  applying  the  same 
edge-detector  with  identical  parameters  to  the  inter¬ 
polated  image.  We  would  like  to  bring  to  the  reader’s 
attention  the  improved  resolution  in  the  central  por- 
tion  of  the  flower  on  the  cup  and  in  the  fringes  of  the 
!  telephone  dial.  Notice  that  the  push  buttons  on  the 

:  telephone  are  less  rounded  in  Fig.  7-c.  Also  notice  the 

newly  detected  edge  along  the  length  of  the  pencil. 
Lastly,  we  point  out  that  Fig.  7-c  seems  to  have  more 
false  positives  than  Fig.  7-b.  <rb|ur  for  this  image  was 
estimated  to  vary  between  0.5  pixel  and  1.5  pixel. 


Fig.  7-c.  Indooi  S. .  I.*-.-  :  Edge  Image  with  In¬ 
terpolation 


t 

i 


I 


We  cm  In. I.  '.viili  an  illustration  of 

the  algorithm''.  ;  •  .  ■:  m  on  a.t  image  of  printed 

text.  Fig.  n  a  i  m  igiioii  i  Iu2  >:  256 1  image,  Fig. 
8-b  is  the  ''lire  ,  li,.g  .  image  without  prior 
interpolation  m  l  t  .  -  i-  she  edge  image  with 
interpolation  II,  I  ■  i ■  1  ■ .  mu-  oiiilined  in  [0]  was 
employed  in  I  ni;  ...  In  this  image  was 

estimated  to  I..  d  -a  ! 


V.  Discussion 

We  have  '  ,•  1  ,  ms'  rate  a  technique  to 

improve  the  re-.  •  ,  .  ..!■., mill  y  of  any  edge- 
iletector.  It  invuh.  .  "‘e  iion  of  the  underlying 

intensity  surface  m  .  t. .  step-edge  detection. 
Although  we  (m  i  i-  I  .  . pli'  ii  interpolation  in  the 

examples  present e-i  to'  mi erpolat ion  sti'p  can  often 
be  incorporated  in  .  ■  quetii  pro-essing.  This  is 

particularly  simple  ’■  the  lirst  step  of  the  edge- 
detector  involves  : ! ■  <  ;  |.ii>  ;it  ion  of  a  linear  space- 

invariant  operator  i-urh  an  operator  can  always  be 
represented  as  a  ■•■mv olnt  ion  mask).  Of  course,  if  the 
edge  operator  is  •■■n-ady  based  on  implicit  interpola¬ 
tion,  e.g.  the  deriv  ,iti»  of  alt  tut  erpolat  ion  function, 
then  the  preprocessing  mat  he  redundant. 

Historically,  edges  in  images  have  been  classified 
as  step-edges,  roof--  iges  and  line-edges  (see  ||j)  and  it 
has  been  implicit  in  -very  discourse  on  the  subject,  to 
the  best  of  our  knowledge,  that  each  of  these 
categories  requires  a  special  purpose  detector.  We 
have  demonstrated  that  a  line-edge  is  simply  a  pair  of 
interacting  step-edges  which  can  be  individually 
detected.  However,  the  interaction  between  step- 
edges  causes  their  parameter  estimates  to  be  biased  — 


985 


eep  «H  copy  and  lymbc 

small  type  thm  lines  an 
qua  to  spai  ..ns  should  b  € 
colored  type  should  be 
backgrounds  White  an 
show  dust  and  dirt  as  \a 
tractive  glare 


Fig.  8- a.  Text  :  Original  Image  (192  x  256) 

©@g>  ©00  <m&>y 

small  typ®,  tUhidmi  Unm  m r 


©®0©ir©d3  ftyp©  s(hi©yld  b© 
te©kgir©y(n)ds.  WWts  mn 
sDd@w  <§)y§tl  ©mid  diet  @s  v 
0mm. 

Fig.  8-b.  Text  :  Edge  Image  without  Interpolation 


®®(p>  @00  ©@(0)^  ©Ca)(o] 

small  type,  thorn  000® 


(backgrounds.  Whte  ai 
show  dust  and  dirt  as 
tractive  glare. 

Fig.  8-c.  Text  :  Edge  Image  with  Interpolation 


the  more  the  interaction,  the  larger  are  the  biases. 
First  order  corrections  for  these  biases  can  be  subse¬ 
quently  obtained  from  knowledge  of  the  effective  blur¬ 
ring  function  associated  with  the  imaging  system. 

The  discrimination  of  a  step-edge  from  noise  in 
an  image  depends  on  the  magnitude  of  the  blurred 


step.  While  this  magnitude  is  the  same  as  that  of  the 
corresponding  unblurred  step  when  the  edge  is  iso¬ 
lated,  it  may  be  substantially  lower  when  the  step- 
edge  is  interacting,  as  is  evident  from  Fig.  5.  The 
contribution  of  the  line-component 

(f(x)  =  0,  h,  0  x<-w/2,  -w/2<x<w/2,  w/2<x} 

to  the  height  of  the  blurred  step  is 
W/2  1 

f  h  _  exp( — — )  dx,  assuming  a  normalized 

_w/2  2<rWw 

Gaussian  blurring  function. 

For  a  largely  bandlimited  image,  resolution  is 
determined  by  the  degree  of  blur.  There  exists  a  large 
body  of  work  on  edge-detection  which  recommends 
implicit  or  explicit  smoothing  of  the  image  samples  as 
the  first  step.  The  most  commonly  used  smoothing 
function  is  the  2-D  Gaussian  (see  [11]).  It  can  easily 
be  shown  that  smoothing  with  this  function  would 
result  in  the  smoothed  image  having  an  effective 
Gaussian  blurring  function  with  a  variance  equal  to 
the  sum  of  the  variance  of  the  original  Gaussian  blur¬ 
ring  function  and  that  of  the  Gaussian  smoothing 
function.  Hence,  resolution  between  interacting  step- 
edges,  and  their  detection  will  both  suffer.  Also,  the 
biases  in  their  position  and  intensity  estimates  will 
increase.  The  extent  of  this  deterioration  will  of 
course  depend  on  the  magnitude  of  <rsmooth  (<rsmoolh 
typically  has  a  suggested  lower  bound  of  1.0  (see  [5])). 

Although  the  proposed  approach  to  detect  line- 
edges  is  more  general  than  any  special  purpose  line- 
detector,  we  should  also  expect  it  to  be  more  sensitive 
to  noise  with  respect,  to  false  positives,  particularly  at 
high-gradient  smoothly-shaded  regions,  false  nega¬ 
tives,  and  the  parameters  of  the  true  positives.  As  is 
evident  from  Fig.  3,  improvement  in  resolution  is 
necessarily  accompanied  by  a  shrinking  of  the  support 
used  on  the  interpolated  profile  for  detection.  This 
results  in  reduced  robustness  to  local  noise-induced 
fluctuations  in  the  reconstructed  intensity  profile. 
Thus,  we  have  a  trade-off  between  robustness  and 
generality.  This  should  come  as  no  surprise  consider¬ 
ing  that,  matched-filtering,  wherein  noisy  patterns  are 
categorized  based  on  their  closest  “match"  to  noiseless 
representatives  of  the  different  classes,  is  well  known 
to  be  optimal  (see  [6]). 

We  conclude  by  noting  that  if  the  suggested 
interpolation  cannot  be  conveniently  incorporated  into 
subsequent  processing,  then  it  is  computationally 
desirable  to  exploit  the  separability  of  the  interpola¬ 
tion  filter,  i.e.  a  2-D  separable  filter  can  be  imple¬ 
mented  as  two  successive  1-D  filters  in  the  x  and  y 
directions.  It  is  also  advisable  to  avoid  redundant 
computations  like  multiplication  by  a  1  or  a  0,  or  the 
addition  of  a  0.  To  get  an  idea  of  the  resulting  sav¬ 
ings,  consider  the  filter  shown  in  Fig.  4.  A  naive 
application  of  this  filter  would  require  approximately 
100  operations  per  interpolated  value,  while  a  more 
sophisticated  implementation  would  require  only  5. 
This  is  easily  verified  by  noting  that  every  interpo¬ 
lated  value  can  be  obtained  by  the  application  of  the 

1-D  filter  ~{0  -1  0  9  16  9  0  -1  0}  to  the  partially 
interpolated  image. 


986 


Appendix  :  Interpolation  using  2-D  Separable  Filter 

We  show  here  that  for  any  sampled  surface  of  the  form 

3  3 

I(x,  y)  =  E  S  CijX'yj  EE  <5(x  n)5(y-m) 

i^Oj^-0  n  m 

II  if  k  =  o 

where  «  (It)  =  |Q  jf  ^  Q 


we  will  get  perfect  interpolation  using  the  separable 
filter  r(x)r(y)  where  r(x)  is  designed  to  reconstruct 
sampled  1-D  signals  in  the  space  spanned  by  jlxr 
x3}.  The  reconstructed  signal  can  be  expressed  as 


R(x,  y)  =  //  I(o,  >1)  r(x-o)  r(y -#)  do  d^ 

=  //(SEcijn'^  EE^n)^.i  n>) 

*  i=0  j -0  n  m 

r(x  o)  r(y-^)  do  d  j 

3  3 

=  EEs  [/qI  r(x  f>)  E  6ia  n) drt] 

i  0  j  =  0  n 

[jV  r(y-,;fl  E6(^m)d;d] 


3  3 

=  E  E  °ij  [E  n‘  r(xn)]  [E  mi  r(y  m)] 

i— 0  j— 0  n  m 

3  3 

=  E  E  cij x'  yJ  as  j.]  =  x1,  \d  by  design 
i=o  j=o 


[8]  M.H.Ilueekel,  "A  Local  Visual  Operator  which 
Recognizes  Edges  and  Lines,"  J.  ACM,  Vol.20, 
No.  1,  Oct.  1973,  634-617. 

[9j  V.S.Nalwa  and  T.O.Binford,  "On  Detecting 
Edges."  IEEE  Trans.  PAMI-8,  No. 6,  Nov.  1986, 
699-7  14. 

[10]  VV.K.I  ’ratt,  Digital  Image  Processing.  John 
Wiley  k  Sons,  New  York,  1978. 

Hi]  V. Torre  and  T.Poggio,  "On  Edge  Detection,” 
IEEE  Trans.  PAMI-8,  No. 2,  March  1986,  147- 
163. 


Acknowledgement 

The  author  would  like  to  thank  his  advisor,  Tom 
Binford,  for  his  help.  This  work  was  prompted  by 
discussions  with  Gerard  Medioni  and  Jean  Ponce. 


References 

[1]  K.E.Atkinson,  An  Introduction  to  Numerical 
Analysis.  John  Wiley  &  Sons,  New  York,  1978. 

[2]  R.N.Bracewell,  The  Fourier  Transform  and  its 
Applications.  McGraw-Hill,  New  York,  1978. 

[3]  H.B. Barlow,  “Reconstructing  the  Visual  Image  in 
Space  and  Time,"  Nature,  Vol.279,  May  1979, 
189-190. 

[4]  M. Brady,  “Computational  Approaches  to  Image 
Understanding,”  Computing  Surveys,  Vol.14, 
No.l,  March  1982,  3-71. 

[5]  F.J. Canny,  “Finding  Edges  and  Lines  in  Images" 
AI-TR  720,  M.I.T,  A.I.  Lab.,  June  1983. 

[6]  J.W. Goodman,  Introduction  to  Fourier  Optics. 
McGraw-Hill,  New  York,  1968. 

[7]  R.M.Haralick,  “Digital  Step  Edges  from  Zero 
Crossing  of  Second  Directional  Derivatives,” 
IEEE  Trans  PAMI-6,  No.l,  Jan.  1984,  58-68. 


987 


Detection,  Localization  and  Estimation  of  Edges1 

J.S.  Chen  and  G.  Medioni 

Institute  for  Robotics  and  Intelligent  Systems 
Departments  of  Electrical  Engineering  and  Computer  Science 
University  of  Southern  California 
Los  Angeles,  California  90089-0273 


Abstract 

We  present  a  method  to  detect,  locate  and  estimate  edges 
in  a  one  dimensional  signal.  It  is  inherently  more  accurate 
than  all  previous  schemes  as  it  explicitly  models  and  cor¬ 
rects  interaction  between  nearby  edges.  The  method  is  it¬ 
erative  with  initial  estimation  of  edges  provided  by  the  zero 
crossings  of  the  signal  convolved  with  a  LoG  (Laplacian  of 
Gaussian)  filter.  The.  necessary  computations  necessitate 
knowledge  of  this  convolved  output  only  in  a  neighborhood 
around  each  zero  crossing  and  in  most  cases,  could  be  per¬ 
formed  locally  by  independent  parallel  processors.  Even 
though  the  model  used  for  edges  is  very  crude,  the  results 
are  very  impressive  and  allow  us  to  reconstruct  the  signal 
extremely  well.  We  show  results  on  1-dimensional  slices  ex¬ 
tracted  from  real  images,  and  on  images  which  have  been 
processed  independently  in  the  row  and  column  directions. 
We  provide  an  analysis  of  the  method  including  issues  of 
complexity  and  convergence,  and  outline  directions  of  fu¬ 
ture  research. 

1  Introduction 

In  order  to  achieve  image  understanding,  biological  or 
computer  system  must  relate  the  raw  input  data  (viewer- 
centered)  to  the  physical  events  that  cause  it,  in  particular 
the  objects  being  observed  (object-centered).  The  first  step 
in  this  complex  sequence  of  operations  is  to  obtain  a  com¬ 
pact  description  of  the  input  image.  At  this  early  stage  of 
processing,  it  is  essential  for  the  resulting  representation 
to  be  complete,  that  is  to  capture  all  of  the  information 
contained  in  the  original  image.  It  should  be  therefore 
possible  and  easy  to  reconstruct  the  original  signal  from 
the  representation,  as  suggested  by  (Marr79,Marr80|.  The 
other  necessary  characteristic  of  the  representation  is  that 
it  should  be  meaningful,  that  is  to  make  explicit  the  im¬ 
portant  properties  of  the  imaged  scene.  As  noted  by  many 
authors  [Binford81,Brady82,Marr80],  physical  edges  of  the 
objects  are  fundamental  descriptions  of  these  objects  as 
they  relate  to  transitions  in  surface  orientation  or  texture. 


‘This  research  was  supported  by  the  Defense  Advanced  Research 
Projects  Agency  under  contract  number  F33615-84-K-1404,  mon¬ 
itored  by  the  Air  Force  Wright  Aeronautical  Laboratories,  Darpa 
Order  No.  3119. 


These  edges  are  very  likely  to  be  projected  onto  edges  in 
the  intensity  data  received  by  the  sensor.  From  the  short 
analysis  above,  it  therefore  appears  that  the  initial  task  of 
a  vision  system  is  the  detection,  localization  and  charac¬ 
terization  of  the  local  2-dimensional  edges  in  intensity. 

We  begin  by  reviewing  the  major  approaches  taken  by 
previous  researchers  to  accomplish  edge  detection,  local¬ 
ization  and  estimation.  We  then  point  out  that,  in  spite 
of  the  claims  of  optimality,  the  resulting  edges  are  disap¬ 
pointing  because  the  model  used  is  unrealistic,  namely  that 
of  a  single  isolated  edge.  We  then  proceed  to  introduce  a 
model  formulation  which  explicitly  takes  into  acount  the 
interaction  between  edges,  and  develop  an  iterative  scheme 
to  correct  this  interaction.  After  that,  we  provide  an  anal¬ 
ysis  of  the  implementation,  addressing  issues  of  complex¬ 
ity,  stability,  and  convergence.  The  next  section  illustrates 
our  method  on  a  few  examples  both  in  1-D  and  2-D.  For 
each  image,  we  also  "'ovide  a  reconstructed  picture  from 
our  edge  descrii  uon.  Finally,  we  provide  some  conclusions 
and  ouDine  further  research  directions. 

2  Previous  Work 

From  the  very  early  days  of  computer  vision,  edge  de¬ 
tection  was  recognized  as  necessary,  with  a  simple  imple¬ 
mentation  of  a  gradient  function  [Roberts65],  Noise  sensi¬ 
tivity  forced  the  inclusion  of  a  smoothing  step  before  differ¬ 
entiation  [Rosenfeld7lj.  A  more  rigorous  approach  based 
on  the  ideal  model  of  an  edge  led  to  the  Hueckel  operator 
with  the  first  claim  of  optimality.  More  claims  of  opti¬ 
mality  have  been  made  by  Marr  and  Hildreth  [Marr80]  , 
extending  the  initial  work  of  Marr  and  Poggio  |Marr79] 
and  the  edges  from  this  detection  scheme  have  become  a 
standard  gauge  against  which  all  other  methods  are  com¬ 
pared.  The  basic  approach  is  to  convolve  the  signal  with  a 
rotationally  symmetric  Laplacian  of  Gaussian  mask  (some¬ 
times  approximated  by  a  Difference  of  Gaussians),  and  to 
locate  zero-crossings  of  the  convolution.  A  recent  paper 
by  Torre  and  Poggio  [Torre86]  judiciously  points  out  that 
better  results  may  be  obtained  by  using  2  directional  filters 
with  directional  derivatives,  especially  in  the  neighborhood 
of  corners. 


988 


Other  different  approaches  have  made  claim  of  opti¬ 
mality  or  shown  results  subjectively  comparable  or  supe¬ 
rior  to  the  Marr-Hildreth  zero-crossing  detector.  Among 
them  : 

1.  Dickey  and  Shuningham  [Dickey77]  define  an  edge 
as  a  step  discontinuity  between  regions  of  uniform 
intensity  and  show  that  the  ideal  filter  is  given  by 
a  prolate  spheroidal  wave  function.  Lunscher  and 
Beddoes  [Lunscher86]  show  that  the  Marr-Hildreth 
filter  is  nearly  identical,  but  simpler  to  study  and 
implement. 

2.  Haralick  [Haralick80,Haralick81,Haralick82]  locates 
edges  at  zero  crossings  of  the  second  directional  deriva¬ 
tive  in  the  direction  of  the  gradient.  Derivatives  are 
computed  by  interpolation  of  the  sampled  intensity 
values.  On  some  images,  the  resulting  edges  are  vi¬ 
sually  better  than  the  ones  from  the  Marr-Hildreth 
detector.  Torre  and  Poggio  [Torre86]  give  an  excel¬ 
lent  analysis  of  the  difference  between  the  Laplacian 
and  the  second  derivative  in  the  direction  of  the  gra¬ 
dient. 

3.  Canny  [Canny83]  shows  that,  in  1-D,  the  optimal 
filter,  according  to  his  criterion,  is  a  linear  combi¬ 
nation  of  four  exponentials,  well  approximated  by  a 
first  derivative  of  a  Gaussian.  In  2-D  images,  he  pro¬ 
poses  to  use  a  combination  of  such  filters  with  vary¬ 
ing  length,  width  and  orientation.  Since  it  is  a  first 
derivative  operator,  it  requires  further  thinning  and 
thresholding.  (Our  implementation  of  it  uses  mod¬ 
ules  from  the  “Linear”  feature  extraction  program 
[Nevatia80].)  The  results  on  real  images  are  quite 
good.  This  can  be  attributed,  in  our  view,  to  the 
fact  that  the  thinnest  mask  possible  is  used,  there¬ 
fore  minimizing  edge  interaction  effects. 

4.  Shen  and  Castan  [Shen86]  recently  proposed  an  “op¬ 
timal”  linear  filter,  in  which  images  are  convolved 
with  the  smoothing  function  f(x)  =  —  ^ln(b)b^  prior 
to  differentiation.  They  claim  better  localization  than 
Marr-Hildreth  zero-crossing  detector.  On  real  im¬ 
ages,  good  edges  are  obtained  after  a  few  extra  fil¬ 
tering  steps  tuned  by  the  user. 

It  is  quite  important  to  note  that,  in  spite  of  these 
claims  of  optimality,  edges  are  quite  disappointing.  Also, 
precisely  because  these  edges  are  poor,  very  few  authors 
attempt  to  reconstruct  the  original  signal. 

It  is  our  belief  that  edge  detection  should  not  crucially 
depend  on  small  differences  in  the  smoothing  function,  and 
also  that  trying  to  design  a  filter  that  will  always  respond 
with  a  zero-crossing  at  the  correct  location  is  as  unreal¬ 
istic  as  trying  to  accomplish  image  understanding  in  one 
step.  The  main  problem  with  all  previous  approaches  is 
that  they  assume  an  unreasonable  model,  namely  an  iso¬ 
lated  edge,  when  we  all  observe  that  edges  often  occur 
close  together. 


The  following  sections  explain  and  illustrate  how  to 
obtain  good  edges  from  an  initial  estimate  by  explicitly 
modelling  edge  interaction. 


3  Description  of  the  Method 

3.1  Informal  Description 

Our  purpose  is  to  accurately  estimate  both  the  location 
and  magnitude  of  edges,  initially  in  a  one  dimensional  sig¬ 
nal.  The  two  sources  of  error  in  accurately  locating  edges 
is  interaction  between  them,  and  the  lack  of  symmetry  of 
the  edge  profile,  and  it  is  our  observation  that  edge  in¬ 
teraction  is  the  larger  of  the  two.  Effects  of  edge  profiles 
on  localization  are  studied  in  another  paper  [Medioni86]. 
To  correct  this  interaction,  we  therefore  use  a  very  simple 
model  in  which  edges  are  perfect  steps.  Each  step  is  com¬ 
pletely  specified  by  its  location  and  signed  magnitude.  As 
a  result,  the  convolution  of  the  signal  with  a  LoG  filter 
can  be  expressed  at  any  point  by  a  weighted  sum  of  the 
responses  from  each  discontinuity. 

During  the  initialization  phase,  the  number  of  discon¬ 
tinuities  is  given  by  the  number  of  zero  crossings,  the  lo¬ 
cation  by  the  (interpolated)  position  of  the  zero  crossings, 
and  the  magnitude  by  an  estimation  involving  convolution 
of  the  signal  with  the  first  derivative  of  a  Gaussian,  as  ex¬ 
plained  in  detail  in  the  next  subsection. 

The  iteration  proceeds  by  hypothesize  and  verify  :  Sup¬ 
posing  that  the  signal  is  well  approximated  by  the  initial 
train  of  discontinuities,  we  can  synthesize  the  correspond¬ 
ing  convolution  with  a  LoG  mask.  Due  to  the  interaction 
between  nearby  edges,  the  correct  location  of  a  discontinu¬ 
ity  is  no  longer  at  the  zero  crossing  position  but  at  some 
nearby  location  corresponding  to  a  nonzero  level  crossing. 
Once  the  new  locations  are  computed,  we  estimate  the 
corresponding  magnitude  of  the  step.  This  iterative  im¬ 
provement  continues  until  all  values  have  converged  to  a 
stable  position,  or  until  the  improvement  become  negligi¬ 
ble. 


It  is  important  to  notice  that  we  only  use  the  values  of 
the  signal  convolved  with  the  LoG  filter  in  a  small  neigh¬ 
borhood  around  each  zero  crossing  to  perform  our  compu¬ 
tation.  It  is  also  interesting  to  note  that,  even  though  our 
model  is  very  crude,  it  estimates  the  edges  so  nicely  that 
the  reconstructed  signal  is  a  very  acceptable  approxima¬ 
tion  of  the  original,  as  shown  in  the  results. 


3.2  Mathematical  Description 


We  use  a  simple  model  for  edges,  namely  a  step  edge.  Let 
the  jth  step  edge  at  location  z,-  with  step  size  s,  is  denoted 
by  /,  : 


h 


ci 

ci  +  si 


X  <  Zj 
X  >  Zj 


(1) 


I 

I 


I 

{ 


989 


with  the  additional  constraint  that  cJ+i  =  c,  +  s,. 

And  we  assume  that  the  signal  can  be  well  approxi¬ 
mated  by  the  function  I(x)  which  is  the  sum  of  all  edges 


By  evaluating  this  expression  at  n  points,  such  as  the 
detected  edge  points,  we  obtain  a  linear  system  of  equar 
tions  : 


/(*)  =  £/,(*) 

J>'=1 

The  second  derivative  of  a  Gaussian  is  : 


(2) 


H{zk)  =  £>ye-^^^i2;L  k  =  l,2,...,n  (8) 

3  =  1 

In  matrix  notation,  we  have  : 


Mx)  =  ^(! -£))«-£  (3) 

Therefore  the  convolution  of  the  approximating  signal 
and  of  a  LoG  filter  is  : 


F(x)  =  I(x)  *  L{x)  =  \  j2sj(x-  zrfe  (4) 

3=1 

Each  term  in  the  sum  above  represents  the  convolution 
of  a  single  isolated  step  edge  with  a  LoG  filter.  It  is  ex¬ 
actly  the  first  derivative  of  a  Gaussian  function,  in  which 
the  zero  crossing  is  located  at  Zj,  the  peak  and  valley  are 
at  a  distance  of  a  on  either  side  of  Zy,  and  the  amplitude 
is  proportional  to  sy,  the  size  of  the  step.  If  adjacent  edges 
are  far  apart,  then  the  contribution  at  any  point  is  due  to 
a  single  edge,  since  the  exponential  terms  corresponding 
to  the  other  edges  are  extremely  small. 

In  particular,  F(zk)  =  0,  and  the  position  of  each  zero 
crossing  accurately  reflects  the  location  of  the  edge.  Most 
of  the  time,  however,  edges  will  interact,  and  the  value  at 
true  edge  location  zk  is  no  longer  0,  since 

F(2*)  =  -  zi)e'~^L  (5) 

The  above  equations  naturally  suggest  an  iterative  scheme 
to  obtain  the  true  location  of  edges  if  we  have  a  good  ini¬ 
tial  estimate  of  them  : 


where, 


H  =  M  x  S 


(9) 


H  is  a  column  vector  where  =  H(zt), 
S  is  a  column  vector  where  Sy  =  «y,  and 

(*'V>a 

M  is  a  n  x  n  matrix  with  M,y  =  e 


Note  that  M  is  a  extremely  well  behaved  matrix,  sym¬ 
metric,  positive  definite,  with  l’s  on  the  diagonal.  For  all 
pratica!  purposes,  it  can  also  be  considered  as  a  band  ma¬ 
trix.  The  inversion  of  this  matrix  is  quite  straightforward, 
leading  to  a  set  of  values  for  the  step  sizes. 


We  can  now  give  the  iteration  scheme  in  detail  : 

1.  Initial  estimation 

At  iteration  0,  the  locations  of  the  edges  are  given  by 
the  zero  crossings  of  the  input  signal  convolved  with 
a  LoG  filter. 

2.  Iteration  formula 

At  any  iteration  t,  S  is  obtained  by  computing  M-1  x 
H.  This  provides  an  estimate  for  the  step  sizes.  Then 
z*+l*  is  the  point  closest  to  such  that 


(.*'>- 


(101 


3=1 


Given  an  estimate  7'1)  of  our  edges  at  iteration  t,  then 
we  can  find  the  new  and  more  accurate  location  z^<+1^  of 
the  k,h  edge  as  the  point  whose  value  is  F(z^). 

In  order  to  compute  equation  (5)  above,  we  also  need 
an  estimate  of  step  size  sr  It  is  tempting  to  use  equation 
(4)  again  to  perform  this  estimation.  Our  experiments 
with  such  an  approach  gave  rise  to  unstable  results  due  to 
the  nature  of  the  equations.  Instead,  this  is  accomplished 
using  a  convolution  with  the  first  derivative  of  a  Gaussian. 

The  first  derivative  of  a  Gaussian  is  : 


3.  Stopping  criterion 

Since  the  behavior  of  the  iteration  scheme  is  quite 
stable  and  most  of  the  interaction  of  edges  can  be 
corrected  in  the  first  few  steps,  the  iteration  can  be 
stopped  either  simply  after  a  fixed  number  of  itera¬ 
tion,  or  with  some  simple  error  criterion  such  as  com¬ 
paring  the  difference  between  the  actual  convolution 
output  with  the  modelled  result  at  a  few  points  such 
as  those  level  crossing  points. 

4  Detailed  Analysis 


D{x)  =  -^e  &  (6) 

The  output  of  the  convolution  of  our  train  of  edges 
with  this  filter  is  : 

H(x)  =  I(x)  *  D(x )  =  X)«y e  (7) 

3=1 


4.1  Complexity  of  the  Algorithm 

The  algorithm  consists  of  two  parts.  The  first  part  is  the 
initial  detection  of  the  zero  crossings  of  the  convolution 
of  the  signal  with  the  LoG  mask.  Many  fast  algorithms 
have  been  proposed  for  the  convolution  with  the  LoG  fil¬ 
ters,  see  [Chen86j.  The  zero-crossing  detection,  usually  in 


990 


subpixel  precision,  can  be  done  by  various  interpolation 
schemes,  and  there  is  always  a  trade-off  between  accuracy 
and  efficiency.  We  have  tried  several  interpolation  methods 
(cubic,  quadratic,  and  linear),  and  feel  that  the  improve¬ 
ments  d”e  to  higher  order  terms  do  not  warrant  the  extra 
computation  time,  so  we  adopted  a  linear  interpolation. 

The  second  part  is  the  iteration  scheme.  The  compu¬ 
tation  includes  : 

1.  estimation  of  step  sizes  by  solving  matrix  equation 

(9). 

2.  calculation  of  the  new  levels  using  equation  (5), 

3.  finding  the  new  level-crossings  in  the  neighborhood 
of  corresponding  previous  level  crossings. 

Calculation  of  a  new  level  for  a  particular  edge  from 
equation  (5)  requires  the  summation  of  the  effect  from  all 
the  edges,  but  for  practical  consideration,  we  only  have  to 
include  the  nearby  edges  since  the  rate  of  exponential  de¬ 
crease  is  very  fast.  Finding  the  new  level  crossing  position 
can  be  done  efficiently  either  by  searching  along  the  curve 
in  the  neighborhood  of  a  previous  location,  or  by  extrapo¬ 
lation.  So  the  major  concern  should  be  on  the  estimation 
of  step  sizes. 

Since  the  complexity  of  solving  linear  system  equations 
is  of  the  order  n3  ,  where  n  is  the  number  of  equations,  a 
first  look  of  equation  (9)  seems  to  indicate  that  the  com¬ 
plexity  is  of  the  order  of  n3,  where  n  is  the  number  of 
edges.  But  the  matrix  M  is  a  well  behaved  matrix.  It 
is  symmetric  ,  positive  definite,  and,  for  all  practical  pur¬ 
pose,  since  the  decrease  along  both  sides  of  the  diagonal  is 
exponential,  it  can  be  approximated  by  a  band  matrix.  Ex¬ 
perimental  results  shows  that  the  full  matrix  can  be  very 
well  approximated  by  a  band  matrix  with  bandwidth  of  5, 
which  says  that  no  more  than  3  edges  interact  at  any  given 
point.  So  the  complexity  of  solving  equation  (9),  by  using 
LU-decomposition,  is  only  of  the  order  of  n,  the  number  of 
edges,  instead  of  n3. 

It  is  also  worth  mentioning  that  if  one  of  the  off-diagonal 
elements  is  very  small,  then  the  M  matrix  can  be  broken  at 
that  point  into  2  submatrices  which  can  be  inverted  sep¬ 
arately.  We  are  also  investigating  non-exact  methods  to 
invert  such  a  matrix  that  would  run  in  0(1),  i-e.  constant 
time,  on  a  parallel  machine. 


4.2  Stability  and  Convergence 

Experimentally,  the  algorithm  is  quite  stable,  and  most 
of  the  edge  interaction  etferl  -an  be  corrected  in  the  first 
few  iterations.  Sometimes,  however,  the  iteration  for  some 
edges,  usually  only  one  or  two,  does  not  converge  and  os¬ 
cillate  around  the  correct  edge  position.  The  amplitude  of 
oscillation  is  small  but  not  entirely  negligible.  However, 
this  oscillation  can  be  easily  detected  and  the  correct  edge 

i 


position  can  be  obtained  by  averaging  values  once  oscilla¬ 
tion  starts. 

4.3  Limitations  of  tb  ethod 

The  first  and  basic  limitation  of  our  method  is  the  spatial 
frequency  which  can  be  resolved:  As  the  LoG  is  a  band 
pass  filter,  it  can  only  see  edges  in  its  bandwidth.  This  is 
illustrated  in  the  example  of  a  “bump”,  showing  the  width 
estimation  of  a  bump  edge  as  a  increases:  as  long  as  a  is 
small  enough,  the  width  is  correctly  computed,  and  after 
that,  grows  with  a. 

Since  differentiation  is  typically  an  ill-posed  problem 
[Torre86],  regularization  is  necessary,  in  a  filtering  step. 
Consider  the  example  of  a  bump  of  limited  width,  say  1 
sampling  spacing,  then  for  different  filters,  such  as  different 
a  for  a  LoG  filter,  it  has  different  interpretations.  In  terms 
of  the  limitation  of  our  method,  as  the  size  of  the  LoG  fil¬ 
ter  increases,  i.e.  a  increases,  we  lose  the  high  frequency 
contents  of  the  bump  and  give  a  different  interpretation 
which  is  another  bump  with  larger  width  smaller  height. 

The  other  limitation  also  comes  from  the  issue  of  the 
filter  size,  as  the  size  increase,  the  smoothing  effect  in¬ 
creases,  therefore  some  edges  will  not  be  detected  at  the 
initial  zero-crossing  detection.  Our  method  only  tries  to 
correct  those  initially  detected  edges,  and  we  are  trying  to 
interpret  the  signal  at  a  particular  level  where  the  resolu¬ 
tion  is  decided  by  the  value  of  a. 

Since  we  start  with  a  very  simple  model  for  an  edge, 
namely  an  ideal  step  edge,  there  are  limitations  when  the 
model  does  not  fit.  For  example,  in  the  case  where  a 
“sloped”  edge  is  involved  such  as  a  ramp,  or  when  there 
is  a  slope  difference  on  the  two  sides  of  a  step,  our  model 
can  no  longer  fit  and  just  try  to  make  a  best  interpretation 
with  the  ideal  step  edge  model.  Our  experimental  results 
shows  that  in  the  case  of  a  symmetric  ramp,  we  will  get  an 
edge  in  the  middle  of  the  ramp,  and  the  step  size  is  under¬ 
estimated  since  the  intensity  change  of  a  ramp  is  averaged 
over  the  interval  of  the  ramp.  We  believe  that  these  other 
types  of  edges  can  be  detected  and  correctly  estimated,  as 
presented  in  [Medioni86]. 

5  Results 

In  this  section,  we  show  some  comparative  results  from 
pure  zero-crossing  detection  of  LoG  convolution  with  the 
input  signal,  and  from  our  algorithm.  We  first  show  re¬ 
sults  on  some  ideal  signals,  then  we  show  some  results  on 
some  1-dimensional  signal  slices  extracted  from  a  real  2- 
dimensonal  image  and  we  also  produce  the  reconstructed 
signals.  Finally,  we  demonstrate  the  behavior  of  the  method 
in  the  presence  of  noise. 


991 


5.1  Ideal  Signals 

We  first  illustrate  our  method  on  a  ideal  “bump",  which 
consists  of  2  discontinuities  of  equal  magnitude  but  oppo¬ 
site  sign,  as  shown  on  figure  la.  The  width  of  the  bump 
is  3  pixels,  its  magnitude  100  units.  Figure  lb  shows  the 
location  of  the  zero  crossings  as  a  function  of  a,  the  space 
constant  of  the  LoG  mask,  and  it  exhibits  the  expected 
behavior  that  the  2  zero  crossings  “repel"  each  other  more 
and  more  as  a  increases.  Figure  lc  shows  the  effects  of 
our  correction  algorithm  for  increasing  values  of  a.  The 
detected  width  is  constant  up  to  a  certain  value,  then  in¬ 
creases  with  <r,  due  to  the  limits  on  the  spatial  frequency 
that  each  filter  can  resolve.  In  figures  Id  and  le,  we  show  a 
plot  of  the  difference  between  the  convolved  output  of  the 
“true”  bump  (width  =  3,  height  =  100)  and  the  estimated 
one  for  values  ranging  between  zx  —  0.3  and  zx  +  0.3  for 
the  edge  location,  where  zx  is  the  true  position  of  the  edge, 
and  between  70  and  130  for  the  height.  For  a  —  1.4,  there 
appears  to  be  a  clear  “pit”  corresponding  to  the  correct 
values  of  width  and  height  of  the  bump.  But  for  a  =  6.0, 
the  pit  has  disappeared  and  is  replaced  by  a  gentle  slop¬ 
ing  ravine  line  showing  an  interpretation  of  the  signal  as 
a  lower  and  wider  bump.  Note  it  is  not  easy  to  relate  di¬ 
rectly  these  error  diagrams  to  our  iterative  scheme. 

We  now  show  the  other  possible  2-edges  pattern  which 
we  call  a  “staircase" .  It  is  a  succession  of  2  edges  of  the 
same  height  and  sign  separated  by  a  given  width  shown  in 
figure  2a.  In  some  range  of  <r,  this  pattern  generates  3  zero 
crossings,  the  middle  one  corresponding  to  an  inflection 
point  rather  than  a  real  edge.  We  presented  a  method  to 
detect  such  occurences  in  a  previous  paper  [Medioni86]  : 
for  inflection  points,  /  x  /"  >  0,  where  /'  and  f  '  are  the 
input  signal  convolved  with  the  first  and  the  third  deriva¬ 
tive  of  a  Gaussian  respectively.  It  happens,  however,  that 
some  valid  edges  are  discarded  by  this  method  when  a  is 
so  large  that  interaction  in  /  produces  the  wrong  sign. 
For  that  reason,  here,  we  prefer  not  to  use  this  test  and 
consider  all  zero  crossings  as  potential  edges. 

In  figure  2b  we  show  the  initial  zero-crossing  detection 
with  respect  to  different  level  of  a  ,  we  can  observe  3  ze- 
rocroesings  between  a  =  2.0  and  a  =  7.7,  the  one  in  the 
middle  is  exactly  the  inflection  point,  i.e.  a  zero-crossing 
which  does  not  correspond  to  an  edge.  Also  note  in  figure 
2b  that  before  it  goes  out  of  the  bound  of  resolving  two 
edges,  the  edges  start  to  interact  and  the  zero-crossing  lo¬ 
cations  become  closer  to  each  other  and  finally  becomes 
only  one  single  edge.  We  can  see  on  figure  2c  which  shows 
the  result  of  our  method  and  the  cancellation  of  the  edge 
interaction  effect.  Note  that  the  inflection  point  is  quite 
“unstable” ,  actually  the  step  Bize  of  that  inflection  point  is 
becoming  relatively  small,  and  we  can  remove  it  by  simple 
thresholding.  In  figure  2d  we  apply  a  threhold  value  of 
of  one  of  step  sizes  of  the  staircase,  and  we  get  a  clean  and 
accurate  locations  of  the  two  edges  before  it  goes  beyond 
resolution  (<r  =  7.7). 


5.2  Real  Images 

Figure  3a  shows  the  original  “cube”  image  of  size  375  x 
380  x  Sbits.  Figure  3b  shows  the  zero-crossings  of  the  im¬ 
age  convolved  with  a  rotationally  symmetric  LoG  filter 
with  a  =  4.0.  In  the  last  3  figures,  figures  3c,  3d,  and 
3e,  the  results  are  derived  from  first  processing  the  im¬ 
age  row-by-row,  which  is  now  a  1-dimensional  processing 
and  we  can  apply  our  algorithm  directly,  then  processing 
the  image  column-by-column,  finally  combining  the  results 
from  both  directions.  In  particular,  the  reconstructed  im¬ 
age  figure  3e  is  derived  by  averaging  the  row-by-row  and 
column-by-column  reconstructed  images.  Figure  3c  shows 
the  combination  of  the  zero-crossings  of  the  original  im¬ 
age  convolved  with  1-D  LoG  filter  with  a  =  4.0  in  the 
horizontal  and  vertical  directions.  Applying  our  algorithm 
to  refine  this  edge  information,  we  get  the  result  of  figure 
3d.  Note  that  there  some  significant  differences  between 
figure  3c  and  figure  3d:  In  figure  3d,  since  our  algorithm 
gets  more  accurate  edge  locations,  the  edges  derived  from 
two  directional  processes,  namely  row-by-row  and  column- 
by-column,  almost  coincide  each  other  perfectly  except  at 
the  bottom  right  of  the  cube,  because  these  edges  are  pro¬ 
nounced  ramp  edges  which  our  model  can  not  fit. 

Figure  4a  and  figure  5a  are  two  slices  of  signals  ex¬ 
tracted  from  the  “cube”  image  shown  in  figure  3a.  Figure 
4a  is  taken  on  the  horizontal  direction,  figure  5a  on  the 
vertical.  The  reconstructed  signals  shown  in  figure  4b  and 
figure  5b  respectively  are  based  on  the  initial  zero-crossings 
of  the  original  signals  convolved  with  a  LoG  filter,  and  the 
step  sizes  from  the  original  signals  convolved  with  the  first 
derivative  of  a  Gaussian  mask  at  the  corresponding  zero¬ 
crossing  positions.  With  a  very  large  a  of  6.0,  we  can  see 
that  the  strong  interaction  of  edges  makes  not  only  the 
zero-crossing  positions  be  displaced  from  the  exact  edge 
locations,  but  also  the  step  sizes  be  distorted.  The  recon¬ 
structed  signals  after  our  iteration  scheme  are  shown  in 
figure  4c  and  figure  5c  respectively.  The  edges  Me  those 
discontinuities  (or  “jumps”)  of  the  reconstructed  signals 
and  we  can  also  see  the  step  size  associated  with  each  edge. 
We  also  show  in  figure  4d  and  figure  5d  respectively  the 
edge  locations  as  the  iteration  progresses.  We  can  see  not 
only  that  the  iteration  process  converges,  but  also  that 
most  of  the  edge  interaction  is  corrected  in  the  first  few 
iterations. 


5.3  Ideal  edges  in  noise 

In  figure  6a,  we  show  a  “bump  image”  of  size  200  x  200  x 
Sirits  added  with  Gaussian  noise.  The  bump  width  is  5 
pixels  and  the  bump  height  is  100  units.  The  variance  of 
the  Gaussian  noise  is  20  units.  In  this  example,  we  only 
process  the  image  in  row  by  row,  and  each  row  of  the  im¬ 
age  is  processed  independently  by  a  ID  process.  By  using 
a  of  7.0,  the  reconstructed  image  shown  in  figure  6b  is  ob¬ 
tained  without  iteration.  Since  we  are  applying  a  large  a. 


992 


le  reconstruted  bump  width  is  much  wider  than  the  orig- 
al  one.  In  figure  6c,  we  show  the  reconstructed  image 
itained  by  our  method.  We  can  see  the  reconstructed 
jmp  width  is  very  consistent  to  the  original  bump  width 
;  a  result  of  accurate  edge  detection, 
i  figures  7a  and  8a,  we  show  two  horizontal  slices  ex¬ 
acted  from  figure  6a.  The  reconstructed  signals  without 
eration  axe  shown  in  figures  7b  and  8b  respectively.  And 
le  reconstructed  signals  by  our  method  are  shown  in  fig- 
res  7c  and  8c  respectively. 

6  Conclusions  and  Future 
Research 

In  this  paper  we  have  presented  a  technique  to  refine 
Ige  positions  after  the  detection  of  the  zero-crossings  of 
le  original  signal  convolved  with  a  LoG  filter.  We  also 
itimate  the  magnitude  and  the  sign  of  the  step  sizes  mod¬ 
eling  the  edges  which  enables  us  to  approximate  and  re- 
Dnstruct  the  original  signal.  The  spectacular  results  show 
lat  our  algorithm  works  very  well  not  only  on  some  ideal 
gnals,  but  also  on  real  image  data.  We  have  shown  the 
fficiency  of  our  algorithm  and  outlined  the  feasible  par- 
llelism.  We  have  also  shown  the  well-behavior  of  our 
lethod  from  its  stability  and  convergent  properties. 

In  section  4.3  we  discussed  the  limitations  of  our  method, 
lcluding  the  issue  of  the  over-simplified  model  and  the  is- 
ue  of  resolution.  Even  though  we  get  very  good  results 
sing  a  very  simple  edge  model,  namely  the  ideal  step  edge 
rodel,  we  believe  better  results  can  be  obtain  through  the 
se  of  a  more  general  model.  We  also  think  that  there  exist 
ister  and  more  efficent  ways  to  solve  the  system  of  equa- 
ions  and  to  improve  the  convergence  rate.  To  combine  the 
iformation  from  different  scales,  i.e.  using  different  cr’s, 

>  another  feasible  solution  for  compensating  the  defect  of 
ur  model,  and  furthermore,  to  resolve  the  problem  of  res- 
lution. 

Most  important  of  all,  to  make  our  method  really  use- 
ul,  we  must  generalize  the  algorithm  to  2-dimension.  On 
he  subject  of  multiple  scales  processing,  we  believe  that 
■ur  method  provides  a  simple  mechanism  to  established 
orrespondences  between  different  scales,  as  the  position 
if  edges  are  not  affected  by  the  values  of  a,  unless  there  is 
i  change  of  interpretation. 

We  have  shown,  in  the  previous  section,  good  results 
if  the  processing  of  an  image  by  using  2  one-dimensional 
ilters,  and  this  might  be  a  feasible  approach.  We  also 
ntend  to  establish  the  correct  formulation  in  2-D,  even 
hough  the  solution  might  involve  more  complex  methods. 


References 

[Binford81]  Binford,  T.O., “Inferring  Surfaces  from  Images” , 
Artificial  Intelligence,  Vol  17,  1981,  pp.  205-244. 

[Brezins84]  Brezins,  V., “Accuracy  of  Laplacian  Edge  Op¬ 
erators”  ,  Computer  Vision,  Graphics  and  Image  Pro¬ 
cessing,  Vol.  27,  August  1984,  pp.  195-210. 

[Brady82]  Brady,  J.M.,  “Computational  Approaches  to  Im¬ 
age  Understanding”,  Computing  Surveys,  Vol.  14, 
1982,  pp  3-71. 

[Canny83j  Canny,  F.R., “Finding  Edges  and  Lines  in  Im¬ 
ages”  ,  Tech.  Report  720,  MIT  Artificial  Intelligence 
Lab.,  June  1983. 

[Chen86]  Chen,  J.S.,  Huertas,  A.,  Medioni  G.,  “Very  Fast 
Convolution  with  Laplacian-of-Gaussian  Masks” ,  Pro¬ 
ceedings  CVPR-86,  Miami  Beach,  FI.  1986,  pp  293- 
298. 

[Davis75]  Davis,  L.S., “Survey  of  Edge  Detection  Techniques” , 
Computer  Graphics  and  Image  Processing,  Vol.  4, 
April  1975,  pp.  248-270. 

[Dickey77]  Dickey,  F.M.,  Shanmugam,  K.S.,  “Optimal  Edge 
Detection  Filter”,  Appl.  Opt.,  Vol.  16,  No.  1,  Jan. 
1977,  pp  145-148. 

[Haralick80]  Haralick,  R.M.,  “Edge  and  Region  Analysis 
for  Digital  Image  Data”,  Computer  Graphics  Image 
Processing,  Vol.  12,  1980,  pp.  60-73. 

|Haratick81]  Haralick,  R.M.,  “The  Digital  Edge”,  Con}. 
Pattern  Recognition  Image  Processing,  Dallas,  TX. 
1981,  pp.  285-294. 

[Haralick82]  Haralick,  R.M.,  “Zero-crossing  of  Second  Di¬ 
rectional  Derivative  Edge  Operator” ,  SPIE  Proc.  Robot 
Vision,  Arlington,  VA.  1982. 

[Hildreth83]  Hildreth,  H.,“The  Detection  of  Intensity  Changes 
by  Computer  and  Biological  Vision  Systems”,  Com¬ 
puter  Vision,  Graphics  and  Image  Processing,  Vol.  22, 
April  1983,  pp.  1-27. 

(Lunscher86)  Lunscher,  W.H.H.J.,  Beddoes,  M.P.,  “Opti¬ 
mal  Edge  Detector  Design  I  &  II”,  IEEE  Tran,  on 
Pattern  Analysis  and  Machine  Intelligence,  Vol.  PAMI- 
8,  No.  2,  March  1986,  pp  164-187. 

[Marr79]  Marr,  D.  and  Poggio,  T.,  “A  Computational  The¬ 
ory  of  Human  Stereo  Vision” ,  Proceedings  of  the  Royal 
Society  of  London,  B204,  1979,  pp.  301-328. 

[Marr80]  Marr,  D.  and  Hildreth,  H.,  “Theory  of  Edge  De¬ 
tection”,  Proceedings  of  the  Royal  Society  of  London, 
B207,  1980,  pp.  187-217. 

[Medioni86]  Medioni,  G.,  Ulupinar,  F.,  “Edge  Detection 
Using  Zero  Crossings  of  Laplacian  of  Gaussian  Bias 
Correction” ,  To  Be  Published. 

[Nevatia80j  Nevatia,  R.,  Babu,  K.R.,  “Linear  Feature  Ex¬ 
traction  and  Description”,  Computer  Graphics  and 
Image  Processing,  Vol.  13,  1980,  pp  257-269. 


993 


[Roberts65]  Roberta,  L.G.,  “Machine  Perception  of  Three 
Dimensional  Solids",  Optical  nad  Electro-Optical  In¬ 
formation  Processing ,  MIT  Press,  1965,  pp  159-197. 

[Rosenfeld71]  Rosenfeld,  A.,  Thurston,  M.,  “Edge  and  Curve 
Detection  for  Visual  Scene  Analysis" ,  IEEE  Tran,  on 
Computers ,  Vol.  C-20,  1971,  pp  562-569. 


[Shen86j  Shen,  J.,  Castan  S.,  “An  Optimal  Linear  Oper¬ 
ator  for  Edge  Detection",  Proceedings  CVPR-86,  Mi¬ 
ami  Beach,  FI.  1986,  pp  109-114. 

[Torre86]  Torre,  V.,  Poggio,  T.A.,  “On  Edge  Detection", 
IEEE  Tran,  on  Pattern  Analysis  and  Machine  Intel¬ 
ligence^  Vol.  PAMI-8,  No.  2,  March  1986,  pp  147-163. 


120 : 

100  : 

90  i 

60  : 

48  ; 

20  ; 

.  i _ i 

: . 1 . . . . . 

, . . . . : . ; . ! 

0  5  10  15  20  25  30 

70  nean=10.0,  signa=21 .0, 

35  40  45  50  55  60 
nl n=5 . 0, nax=l 1 0 . 0 

(a)  Original  signal 


5.9 . 

5.4 . 

4.9. . 

4.4 . 

3.9 . 

3.4 . 

2.9 . 

2.4 . 

L.9 . 

M . 

40RIZ0NT 

10.0 . 

RL  sHci 

20.0 

7  ,  Sc 

36 

«1 

.0 

e  Spec 

46.6 . 

:e. 

50:0""' 

(b)  Scale-space  before  iteraton  (c)  Scale-space  after  iteraton 


(d)  Error  diagram  at  o  =  1.4  (e)  Error  diagram  at  o  =  6.0 


Figure  1:  The  “bump* 


994 


(a)  Original  signal 


! 

(b)  Scale-space  before  iteraton 


f.  7 . : . ; . 

. h... 

9  fl 

1 

7.0 . ; . i . 

&  3 

L 

5.3 . : . ; . 

\! 

s.s . ; . : . 

. • !  i . . 

■4.9 . : . • . 

1  9 

i’ 

. 1. 

4.Z  .  . • . 

.1 

. .1. 

3.5  . ; . 

?  P 

: . 

2.S . : . : . 

! . ii... 

2.1 . : . : . 

Ira  1'A.d 

?0.0 

■ 

*  I  ■ 

30.6 . 40.' 

$0:0  " 

^8  10.0  20.0 

36.0  40. 

iioRIZOHTRL  slice  9  ,  Scale  Space. 

•10RIZ0NTRL  slice  7  ,  Scale  Space. 

$0.6 


(c)  Scale-space  after  iteraton  without  thresh-  (j)  Scale-space  after  iteraton  with  thresholding 
olding 


Figure  2:  The  “staircase’ 


(a)  Original  image 


(d)  Edges  obtained  from  two  1-D  processing  of 
our  method 


(e)  Reconstructed  image  by  our  method 


Figure  3:  The  “cube" 


(a)  Original  signal 


(b)  Reconstructed  signal  before  iteration 


36.. 

32.. 

28.. 
21.. 
20. 
16.. 
L2. 


P. 


I:  A  "%.6 '  100.0  150 .9 'm 

VERTICAL  slice  181,  signs 


.0  250 

6.8 


(d)  Edge  locations  as  iteration  progresses 


Figure  4:  A  vertical  slice  of  the  "cube*  image 


(a)  Original  image 


(b)  Reconstructed  image  without  iteration 


(c)  Reconstructed  image  by  our  method 


Figure  6:  The  “noisy  bump” 


999 


